Is there any programmer who hasn't read the "Dragon Book,"
the classic text on compiler design by Aho, Sethi, and Ullman? Or tried
to read it? Or at least bought it, and placed it in a prominent position
on their bookshelf? Certainly, you've at least seen the book: the cover
(by Jean Depoian) shows a dragon, representing "Complexity of Compiler
Design," about to be slain by a knight with "Data Flow Analysis"
armor, a "Syntax Directed Translation" shield, and a "LALR
Parser Generator" sword.
You haven't tried to read the Dragon Book? Well, you should. Like every book out of AT&T Bell Labs, it is beautifully written and organized. It is one of those few books that can make you pleased to be involved in software, because while reading the book you come to realize that our field--the messiness of daily practice aside--really does have a solid mathematical foundation. But in my more honest moments, I have to admit that, try as I might, I've never really understood much in the Dragon Book past about page 200 (at least that's where the underlining stops in my copy). The problem, for me at least, is that it's one thing to read about a subject such as the equivalence of languages and machines (a strikingly beautiful discovery) and another thing to really see this equivalence.
For me, and I suspect for a lot of programmers, the only way to illustrate a topic like this is with some code. You say that regular expressions and finite-state machines are equivalent? I can appreciate that this is an important result, but to understand it I need to see some C code that converts a regular expression into a two-dimensional array that forms the program for one of these machines.
This is where Allen Holub's book, Compiler Design in C, comes in. If you have ever wanted to understand how your favorite compiler works, or if you have ever needed to write some form of language processor (perhaps as simple as a text-pattern searcher, or the macro or script language for a product), you will want Holub's book. It is approachable by programmers in a way that the Dragon Book just isn't.
Several good, readily understandable books on compiler design have been available for years. Hendrix's A Small C Compiler (M&T Publishing, 1990), which comes with complete C source code for an 8088-based C compiler, is the best example. You walk away from Hendrix's book seeing exactly how his Small C compiler is put together and feeling that you could "do it yourself."
But real compilers, such as Borland C++ or Microsoft C, aren't built using the readily understandable, recursive-descent technique that Hendrix puts to such good use. These compilers are built, in an initially very nonintuitive way, using compiler-compiler tools such as LEX and YACC. LEX takes a set of regular expressions and associated C code, and turns them into the function yylex(), which tokenizes input (such as your .C file). The C code is essentially "event driven:" It is invoked whenever its associated pattern is recognized in the input. YACC takes a grammar and associated C code, and turns them into the function yyparse(), which parses the tokens generated for example by yylex(). The main() routine for a C compiler might consist of little more than a call to yyparse().
What Holub does in this massive book is amazing. He presents the complete C source code for a LEX clone, two different YACC clones (one of which, Llama, builds a top-down LL(1) parser, while the other, occs--the "other compiler-compiler system" -- builds a bottom-up, LALR(1) parser like that built by (YACC), and a C compiler. The C compiler's source code consists largely, not of .C files, but of a LEX file (c.lex) and a YACC file (c.y). The .C files do symbol-table management, manipulation of lvalues and rvalues, code generation, arithmetic operations, and the like.
DDJ readers may know Holub as this magazine's former C columnist. He brings to Compiler Design in C the same attention to detail found in his earlier The C Companion and On Command (a detailed presentation of the C source code for a command interpreter). Above all, Holub excels at showing how things work. This has always been my favorite kind of reading, escape reading almost. Such "how it works" writing is quite different from "how to" writing. The pleasure is not so much a "Hey, I could do that!" feeling (I know I couldn't), but rather that of seeing a little bit of how the software I use every day actually does its stuff. What's Borland C++ doing when it crunches through my .C files? Having worked through Holub's book, I have a much better idea.
The source code itself is presented in an almost ideal way, with each .C file broken up by just the right amount of text. The order in which code is presented is crucial in a book like this, and Holub has made good use of Arachne, a C preprocessor he wrote. Arachne is a version of Knuth's WEB system (which I discussed in the August 1992 DDJ), allowing source code and documentation to be put together in a single input file. Arachne is itself a compiler and, as Holub notes, it stands as "an example of how you can apply the techniques presented in this book to applications other than writing compilers for standard programming languages." Compiler design is a very general-purpose programming skill.
The chief advantage of Holub's book over the other books on this subject is clearly its skillful presentation of a large amount of C code. But Holub does not just fling gobs of source code at you, as so many programming books do. Many authors discuss the implementation of their programs in loving detail, but forget to describe what the program does, or what output it produces. Revealing the workings of a machine, without disclosing what the machine does, is a classic symptom of engineer's disease. Holub's book doesn't suffer from this at all. He walks the reader through each stage of the compiler process, always remembering to point out what the final goal is, what the program's output will look like, and so on. The C code generated by LEX, Llama, and occs is nicely commented and essentially self-documenting. He constantly shows the input to the program, its output, and exactly how the program turned the former into the latter.
Often, Holub shows the same thing from multiple angles. For example, he presents a regular expression, a hand-written recognizer for that expression, a diagram of the equivalent finite-state machine (FSM) to recognize that expression, a two-dimensional state table representing the machine, and finally the C version of the machine, both compressed and uncompressed, with the driver function that uses the tables. Holub walks carefully through the whole process, showing how a LEX tokenizer can take text such as " 1 + 2 * 3" and turn it into a stream of tokens (input for a Yacc grammar) such as NUM PLUS NUM STAR NUM.
One of the points that becomes clear as you read this book is the tremendous generality of the finite-state machine as a way of programming. In essence, an FSM consists of a two-dimensional table "next," with the rows holding states and the columns holding input. A generic "driver" steps through the table; see Example 1.
The goal of programs like LEX and YACC is to produce the table "next." The driver, which is the while (state = next(state, input)) loop, is generic and changes little between programs. In other words, the "next" table really is a program for a virtual machine.
Since LEX and YACC are themselves just compilers (they compile regular expressions or grammars with associated C code into state tables), it's interesting to see how they themselves are written. Holub builds LEX by hand, using recursive descent. Once you have LEX, you can write LEX.LEX, which would be a table-driven lexical analyzer. Holub leaves this as one of many excellent exercises in the book. (I'm waiting for The Compiler Design in C Answer Book!) With the Llama parser, Holub takes a classic bootstrap approach: First, he uses a LEX file (with plenty of C actions) to implement a scaled-down version of Llama; then he feeds the scaled-down version of Llama with a Llama input file that creates a Llama parser, "like a snake eating its tail."
Even in a book close to a thousand pages long, many topics can't be covered in such complete detail. Many efficiency considerations are avoided by the code, though the text contains good pointers to the relevant literature. The chapter on optimization is a good introduction to this subject, though the C compiler itself is nonoptimizing. Many programmers are interested in compiler optimizations (this, after all, is what compiler benchmarks measure, for want of anything better), but popular topics such as peephole optimization and common-subexpression elimination probably can't be fully understood without first digesting the material in Holub's book.
The code generated by the C compiler is, oddly enough, not 80x86 or 680x0 assembler, much less binary object code, but something called C-Code. This is an assembly language-like form of C. (For example, there are no if ,while, or for statements; everything is reduced to its ultimate goto form.) While it would perhaps have been more satisfying to have the compiler generate actual assembler, C-Code is just as good for this book's purpose.
A disk is not included with the book, but is available separately from the author for $60. I had the usual problems of getting the subdirectories and make files straight, but these seem unavoidable when working with large quantities of someone else's source code, even source code as well commented and well organized as this. In any case, ready-to-run DOS.EXEs are provided. The highlight of this package is a fascinating full-screen "visible" C compiler that shows the compiler's operations. I do wish some screen shots of this program had been included with the book, perhaps instead of the over 50 pages on curses. The full-screen C compiler -- actually it's occs whose operations can be made full-screen, and the C compiler just inherits this like any other occs program -- is built using curses, for which Holub provides a really fairly irrelevant (in this context) description.
Even if you have the commercially supported MKS LEX/YACC, or Gnu FLEX/BISON, you will want to get Holub's book to see how these programs--or at least very similar programs--actually work. The Gnu software comes with source code, but I wouldn't want to tackle it without first having read Holub's book. I had a strange experience after finishing this book. In preparation for this review, I took my copy of the Dragon Book down from the shelf yet again, just to see if it really is such hard going. Oddly enough, I found that I actually understand a lot of it past page 200. Clearly this is because I've worked through Holub's book. If you, too, have been frustrated trying to read the Dragon Book, first read Compiler Design in C (whose cover, incidentally, shows four mice operating some sort of cheese-grater/mailbox contraption). Then try to slay the dragon.
STATE next[NUM_STATES][NUM_INPUTS]; while (state = next(state, input)) if (state == ACCEPT) brake; else do_action(state);