shithub: mc

Download patch

ref: d811a3c7a8fab7269e0798ee884d4d1363d913df
parent: 6368976a9e8e4121c5867ec81842e1d4bac55351
author: Ori Bernstein <ori@eigenstate.org>
date: Tue Jan 31 20:06:24 EST 2017

Not worth the work of finishing or maintaining the compiler doc.

--- a/doc/compiler.txt
+++ /dev/null
@@ -1,336 +1,0 @@
-                Structure of the Myrddin Compiler
-                            Aug 2012
-                          Ori Bernstein
-
-TABLE OF CONTENTS:
-
-    1. OVERVIEW
-        1.1. Tree Structure
-    2. PARSING
-        2.1. Lexing
-        2.2. AST Creation
-        2.3. Type checking
-        2.4. Generic Specialization
-        2.5. Serialization
-        2.6. Usefiles
-    3. FLATTENING
-        3.1. Control Flow
-        3.2. Complex Expressions
-    4. OPTIMIZATION
-        4.1. Constant Folding
-    5. CODE GENERATION
-        5.1. Instruction Selection
-        5.2. Register Allocation
-    6. TUTORIAL: ADDING A STATEMENT
-        6.1. Stubbing in the node types
-        6.2. Parsing
-        6.3. Flattening
-        6.4. Optimization
-        6.5. Instruction Selection
-
-1. OVERVIEW:
-
-    The Myrddin compiler suite consists of a set of binaries, written in C,
-    which translate Myrddin source code to the assembly most appropriate for
-    the target platform, and subsequently invoke the native assembler on it.
-    The linker is not invoked by the compiler, and the final output is an
-    object file for the target platform.
-
-    The compilers are named with a single character for the target platform,
-    with a single character for the language being compiled. A table of the
-    compilers and their names is below:
-
-        Compiler        Platform
-        -------------------------
-        6m              x86-64
-
-
-    The compilation is divided into a small number of phases. The first phase
-    is parsing, where the source code is first tokenized, the abstract syntax
-    tree (AST) is generated, and semantically checked. The second phase is the
-    machine-dependent tree flattening. In this phase, the tree is decomposed
-    function by function into simple operations that are relatively close to
-    the machine. Sizes are fixed, and all loops, if statements, etc. are
-    replaced with gotos. The next phase is a machine-independent optimizer,
-    which currently does nothing other than simply folding trees. In the final
-    phase, the instructions are selected and the registers are allocated.
-
-    So, to recap, the phases are as follows:
-
-        parse           Tokenize, parse, and analyze the source
-        flatten         Rewrite the complex nodes into simple ones
-        opt             Optimize the flattened source trees
-        gen             Generate the assembly code
-
-    1.1. Tree Structure:
-
-        File nodes (n->type == Nfile) represent the files being compiled. The current
-        node is held in a global variable called, unsurprisingly, 'file'. The
-        global symbol table, export table, uses, and other compilation-specific
-        information is stored in this node. This implies that the compiler can
-        only process one file at a time.
-
-        Name nodes (n->type == Nname) are simply names, possibly with a namespace
-        attached. They are left as leaf nodes in the tree, specifying variable
-        names, union tags, and just about anything else with a name.
-
-        Use nodes (n->type == Nuse) simply tell the compiler that a usefile by the
-        name stored in this node will be loaded.
-
-        Expression nodes (n->type == Nexpr) represent expressions. They consist of
-        an operator, a type, a few flags, possibly a declaration ID, and a list of
-        arguments.
-        
-        Operators are defined in parse/ops.def, and start with an 'O' by
-        convention; eg: Oadd, Osub, etc.
-        
-        The declaration id (n->expr.did) is only valid on expressions representing
-        a single variable (n->expr.op == Ovar). The DID is a unique identifier
-        representing the declaration node that the variable refers to. This is used
-        for a variety of things, from fast comparisons to allowing us to put the
-        node into a bit set easily.
-
-        Literal nodes (n->type == Nlit) hold a literal value. The type held is
-        stored in littype, which are defined in parse/lits.def.
-
-        The various statement nodes (Nloopstmt, Nifstmt, Nmatchstmt, Nblock,
-        Nlbl) are all statements that may appear within a block node (Nblock).
-
-        Declaration nodes declare a name in a symbol table. TODO: MORE DETAIL.
-
-        Uelt nodes declare a union element. TODO: MORE DETAIL.
-
-        Func nodes declare a function. TODO: MORE DETAIL.
-        
-
-
-2. PARSING:
-
-    This phase takes in a source file, and outputs a tree that is guaranteed
-    to be valid. The tree nodes are defined in parse/parse.h in struct Node,
-    and have one of the types defined in parse/nodetypes.def. Node types
-    start with 'N' by convention; eg: Nfile, Nifstmt, etc.
-
-    2.1. Lexing:
-        
-        Lexing occurs in parse/tok.c. Because we want to use this lexer from
-        within yacc, the entry point to this code is in 'yylex()'. As required
-        by yacc, 'yylex()' returns an integer defining the token type, and
-        sets the 'tok' member of yylval to the token that was taken from the
-        input stream. In addition, to allow for better error messages, the
-        global variable 'curtok' is set to the value of 'yylval.tok'. This
-        allows yyerror to print the last token that was seen.
-
-        The tokens that are allowable are generated by Yacc from the '%token'
-        definitions in parse/gram.y, and are placed into the file
-        'parse/gram.h'. The lexer and parser code is the only code that
-        depends on these token constants.
-
-        The lexer is initialized through 'tokinit(char *file)'. This function
-        will open the file passed in, read all the data from it in one go
-        and set up the internal data for the tokenizer. The tokenizing is then
-        done while the whole file is in memory, which means that this code
-        will work poorly on files that are larger than the address space
-        available to the compiler. If this is a problem, you deserve all the
-        pain that is caused.
-
-        The file data is stored in the three global variables 'fidx', 'fbuf',
-        and 'fbufsz'. The actual tokenization happens through 'toknext()' and
-        its callees, which operate on these data structures character by
-        character, matching the values read, and shoving them into the 'Tok'
-        data structure.
-
-    2.2. AST Creation:
-
-        The parser used is a traditional Yacc-based parser. It is generated
-        from the source in parse/gram.y. The starting production is 'file',
-        which fills in a global 'file' tree node. This 'file' tree node must
-        be initialized before yyparse() is called.
-
-
-    2.3. Type Checking:
-
-        Type checking is done through unification of types. It's implemented
-        in parse/infer.c. It proceeds through a simple unification algorithm,
-        which is documented in lang.txt. As a result, only the internal
-        details of this algorithm will be discussed here.
-
-        The first step done is loading and resolving use files. This is
-        deferred to the type checking phase for two reasons. First, we
-        do not want to force tools to have all dependencies compiled if they
-        use this parser, even though type full type checking is impossible
-        until all usefiles are loaded. And second, this is when the
-        information is actually needed.
-
-        Next, the types declared in the package section are merged with the
-        exported types, allowing us to start off with our type information as
-        complete as possible, and making sure that the types of globals
-        actually match up with the exported types.
-        
-        The next step is the actual type inference. We do a bottom-up walk of
-        the tree, unifying types as we go. There are subtleties with the
-        member operator, however. Because the '.' operator is used for both
-        member lookups and namespace lookups, before we descend into a node
-        that has operator Omemb, we need to check if it's a namespaced name,
-        or an actual member reference. If it is a namespaced name, we replace
-        the expression with an Ovar expression. This check happens in the 
-        'checkns()' function. Second, because we need to know the LHS of a
-        member expression before we can check if the RHS is valid, and we
-        are not guaranteed to know this at the first time that we see it, the
-        expression is assumed to be valid, and this assumption is verified in
-        a post-processing pass. Casts are validated in a deferred manner
-        similarly.
-
-        Generic variables are added to a list of generic callsites to
-        specialize when they are seen in as a leaf of an Ovar node.
-
-        The type inference, to this point, has only built up a mapping
-        of types. So, for example, if we were to have the inferred types
-        for the following set of statements:
-
-            var a
-            var b
-            var c
-            a = b
-            c = b + 1
-
-        We would have the mappings:
-
-            $t0 -> $t1
-            $t1 -> $t2
-            $t2 -> int
-
-        So, in the 'typesub()' function, we iterate over the entire tree,
-        replacing every instance of a non-concrete type with the final
-        mapped type. If a type does not map to a fully concrete type,
-        this is where we flag an error.
-
-        FIXME: DESCRIBE HOW YOU FIXED GENERICS ONCE YOU FIX GENERICS.
-
-    2.4. Generic Specialization:
-
-        After type inference (well, technially, as the final step of it),
-        we walk through the list of callsites that need instantiations
-        of generics, and create a specialized generic instance for each of
-        them. This specialization is done, unsurprisingly, in specialize.c,
-        by the simple algorithm of cloning the entire tree that needs to
-        be specialized, and walking over all nodes substituting the types
-        that are replacing the type parameters.
-
-    2.5. Serialization:
-
-        Trees of all sorts can be serialized and deserialized from files,
-        as long as they are fully typed. Trees containing type variables (ie,
-        uninferred types) cannot be serialized, as type variables cannot be
-        deserialized meaningfully.
-
-        The format for this is only documented in the source, and is a
-        straightforward dump of the trees to memory. It is constantly shifting,
-        and will not reliably work between compiler versions.
-
-    2.6. Usefiles:
-
-        Usefiles are more or less files that consist of a single character tag
-        that tells us what type of tree to deserialize.  Because serialized
-        trees are compiler version dependent, so are usefiles.
-
-3. FLATTENING:
-
-    This phase is invoked repeatedly on each top-level declaration that we
-    want to generate code for. There is a good chance that this flattening
-    phase should be made machine-independent, and passed as a parameter
-    a machine description describing known integer and pointer sizes, among
-    other machine attributes. However, for now, it is machine-dependent,
-    and lives in 6/simp.c.
-
-    The goal of flattening a tree is to take semantically involved constructs
-    such as looping, and simplify things into something that is easy to
-    generate code for, as well as something that is easier to analyze for
-    optimization.
-
-    3.1. Control Flow:
-
-        All if statements, loops, and other complex constructs are simplified
-        to jumps and conditional jumps. Loops are generally simplified from
-        something that would look like this:
-
-            loop
-                init
-                cond
-                inc
-                body
-
-        To something that would look like this:
-
-                init
-                jmp cond
-            .loop:
-                body
-                inc
-            .cond:
-                cjmp cond .loop .end
-            .end:
-
-        Boolean expressions are simplified to a location to jump to, as
-        described in section 8.4 of the Dragon book[1].
-
-    3.2. Complex Expressions:
-
-        Complex expressions such as copying types larger than a single machine
-        word, pulling members out of structures, emulating multiplication and
-        division for larger integers sizes, and similar operations are reduced
-        to trees that are expressible in terms of simple machine operations.
-
-        By the end of the simplification pass, the following operators should
-        not be present in the trees:
-
-            Obad Oret Opreinc Opostinc Opredec Opostdec Olor Oland Oaddeq
-            Osubeq Omuleq Odiveq Omodeq Oboreq Obandeq Obxoreq Obsleq
-            Obsreq Omemb Oslice Oidx Osize Numops Oucon Ouget Otup Oarr
-            Oslbase Osllen Ocast
-
-
-4. OPTIMIZATION:
-
-    Currently, there is virtually no optimization done on the trees after
-    flattening. The only optimization that is done is constant folding.
-
-    4.1. Constant Folding:
-
-        Expressions with constant values are simplified algebraically. For
-        example, the expression 'x*1' is simplified to 'x', '0/n' is
-        simplified to '0', and so on.
-
-
-5. CODE GENERATION:
-
-    5.1. Instruction Selection:
-
-        Instruction selection is done via a simple handwritten bottom-up pass
-        over the tree. Common patterns such as scaled or offset indexing are
-        recognized by the patterns, but no attempts are made at finding an
-        optimal tiling.
-
-    5.2. Register Allocation:
-
-        Register allocation is done via the algorithm described in "Iterated
-        Register Coalescing" by Appel and George. As of the time of this
-        writing, the register allocator does not yet implement overlapping
-        register classes. This will be done as described in "A generalized
-        algorithm for graph-coloring register allocation" by Smith, Ramsey,
-        and Holloway.
-
-6: TUTORIAL: ADDING A STATEMENT:
-
-    6.1. Stubbing in the node types:
-
-    6.2. Parsing:
-    
-    6.3. Flattening:
-
-    6.4. Optimization:
-
-    6.5. Instruction Selection:
-
-[1] Aho, Sethi, Ullman: Compilers: Principles, Techniques, and Tools, 1988.
-        ISBN 0-201-10088-6