Is there any programmer who hasn't read the "Dragon Book,"
the classic text on compiler design by Aho, Sethi, and Ullman? Or tried
to read it? Or at least bought it, and placed it in a prominent position
on their bookshelf? Certainly, you've at least seen the book: the cover
(by Jean Depoian) shows a dragon, representing "Complexity of Compiler
Design," about to be slain by a knight with "Data Flow Analysis"
armor, a "Syntax Directed Translation" shield, and a "LALR
Parser Generator" sword.
You haven't tried to read the Dragon Book? Well, you should. Like every
book out of AT&T Bell Labs, it is beautifully written and organized.
It is one of those few books that can make you pleased to be involved in
software, because while reading the book you come to realize that our field--the
messiness of daily practice aside--really does have a solid mathematical
foundation. But in my more honest moments, I have to admit that, try as
I might, I've never really understood much in the Dragon Book past about
page 200 (at least that's where the underlining stops in my copy). The problem,
for me at least, is that it's one thing to read about a subject such as
the equivalence of languages and machines (a strikingly beautiful discovery)
and another thing to really see this equivalence.
For me, and I suspect for a lot of programmers, the only way to illustrate
a topic like this is with some code. You say that regular expressions and
finite-state machines are equivalent? I can appreciate that this is an important
result, but to understand it I need to see some C code that converts a regular
expression into a two-dimensional array that forms the program for one of
these machines.
This is where Allen Holub's book, Compiler Design in C, comes in.
If you have ever wanted to understand how your favorite compiler works,
or if you have ever needed to write some form of language processor (perhaps
as simple as a text-pattern searcher, or the macro or script language for
a product), you will want Holub's book. It is approachable by programmers
in a way that the Dragon Book just isn't.
Several good, readily understandable books on compiler design have been
available for years. Hendrix's A Small C Compiler (M&T Publishing,
1990), which comes with complete C source code for an 8088-based C compiler,
is the best example. You walk away from Hendrix's book seeing exactly how
his Small C compiler is put together and feeling that you could "do
it yourself."
But real compilers, such as Borland C++ or Microsoft C, aren't built using
the readily understandable, recursive-descent technique that Hendrix puts
to such good use. These compilers are built, in an initially very nonintuitive
way, using compiler-compiler tools such as LEX and YACC. LEX takes a set
of regular expressions and associated C code, and turns them into the function
yylex(), which tokenizes input (such as your .C file). The C code is essentially
"event driven:" It is invoked whenever its associated pattern
is recognized in the input. YACC takes a grammar and associated C code,
and turns them into the function yyparse(), which parses the tokens generated
for example by yylex(). The main() routine for a C compiler might consist
of little more than a call to yyparse().
What Holub does in this massive book is amazing. He presents the complete
C source code for a LEX clone, two different YACC clones (one of which,
Llama, builds a top-down LL(1) parser, while the other, occs--the "other
compiler-compiler system" -- builds a bottom-up, LALR(1) parser like
that built by (YACC), and a C compiler. The C compiler's source code consists
largely, not of .C files, but of a LEX file (c.lex) and a YACC file (c.y).
The .C files do symbol-table management, manipulation of lvalues and rvalues,
code generation, arithmetic operations, and the like.
DDJ readers may know Holub as this magazine's former C columnist. He brings
to Compiler Design in C the same attention to detail found in his
earlier The C Companion and On Command (a detailed presentation
of the C source code for a command interpreter). Above all, Holub excels
at showing how things work. This has always been my favorite kind of reading,
escape reading almost. Such "how it works" writing is quite different
from "how to" writing. The pleasure is not so much a "Hey,
I could do that!" feeling (I know I couldn't), but rather that of seeing
a little bit of how the software I use every day actually does its stuff.
What's Borland C++ doing when it crunches through my .C files? Having worked
through Holub's book, I have a much better idea.
The source code itself is presented in an almost ideal way, with each .C
file broken up by just the right amount of text. The order in which code
is presented is crucial in a book like this, and Holub has made good use
of Arachne, a C preprocessor he wrote. Arachne is a version of Knuth's WEB
system (which I discussed in the August 1992 DDJ), allowing source code
and documentation to be put together in a single input file. Arachne is
itself a compiler and, as Holub notes, it stands as "an example of
how you can apply the techniques presented in this book to applications
other than writing compilers for standard programming languages." Compiler
design is a very general-purpose programming skill.
The chief advantage of Holub's book over the other books on this subject
is clearly its skillful presentation of a large amount of C code. But Holub
does not just fling gobs of source code at you, as so many programming books
do. Many authors discuss the implementation of their programs in loving
detail, but forget to describe what the program does, or what output it
produces. Revealing the workings of a machine, without disclosing what the
machine does, is a classic symptom of engineer's disease. Holub's book doesn't
suffer from this at all. He walks the reader through each stage of the compiler
process, always remembering to point out what the final goal is, what the
program's output will look like, and so on. The C code generated by LEX,
Llama, and occs is nicely commented and essentially self-documenting. He
constantly shows the input to the program, its output, and exactly how the
program turned the former into the latter.
Often, Holub shows the same thing from multiple angles. For example, he
presents a regular expression, a hand-written recognizer for that expression,
a diagram of the equivalent finite-state machine (FSM) to recognize that
expression, a two-dimensional state table representing the machine, and
finally the C version of the machine, both compressed and uncompressed,
with the driver function that uses the tables. Holub walks carefully through
the whole process, showing how a LEX tokenizer can take text such as "
1 + 2 * 3" and turn it into a stream of tokens (input for a Yacc grammar)
such as NUM PLUS NUM STAR NUM.
One of the points that becomes clear as you read this book is the tremendous
generality of the finite-state machine as a way of programming. In essence,
an FSM consists of a two-dimensional table "next," with the rows
holding states and the columns holding input. A generic "driver"
steps through the table; see Example 1.
The goal of programs like LEX and YACC is to produce the table "next."
The driver, which is the while (state = next(state, input)) loop, is generic
and changes little between programs. In other words, the "next"
table really is a program for a virtual machine.
Since LEX and YACC are themselves just compilers (they compile regular expressions
or grammars with associated C code into state tables), it's interesting
to see how they themselves are written. Holub builds LEX by hand, using
recursive descent. Once you have LEX, you can write LEX.LEX, which would
be a table-driven lexical analyzer. Holub leaves this as one of many excellent
exercises in the book. (I'm waiting for The Compiler Design in C Answer
Book!) With the Llama parser, Holub takes a classic bootstrap approach:
First, he uses a LEX file (with plenty of C actions) to implement a scaled-down
version of Llama; then he feeds the scaled-down version of Llama with a
Llama input file that creates a Llama parser, "like a snake eating
its tail."
Even in a book close to a thousand pages long, many topics can't be covered
in such complete detail. Many efficiency considerations are avoided by the
code, though the text contains good pointers to the relevant literature.
The chapter on optimization is a good introduction to this subject, though
the C compiler itself is nonoptimizing. Many programmers are interested
in compiler optimizations (this, after all, is what compiler benchmarks
measure, for want of anything better), but popular topics such as peephole
optimization and common-subexpression elimination probably can't be fully
understood without first digesting the material in Holub's book.
The code generated by the C compiler is, oddly enough, not 80x86 or 680x0
assembler, much less binary object code, but something called C-Code. This
is an assembly language-like form of C. (For example, there are no if ,while,
or for statements; everything is reduced to its ultimate goto form.) While
it would perhaps have been more satisfying to have the compiler generate
actual assembler, C-Code is just as good for this book's purpose.
A disk is not included with the book, but is available separately from the
author for $60. I had the usual problems of getting the subdirectories and
make files straight, but these seem unavoidable when working with large
quantities of someone else's source code, even source code as well commented
and well organized as this. In any case, ready-to-run DOS.EXEs are provided.
The highlight of this package is a fascinating full-screen "visible"
C compiler that shows the compiler's operations. I do wish some screen shots
of this program had been included with the book, perhaps instead of the
over 50 pages on curses. The full-screen C compiler -- actually it's occs
whose operations can be made full-screen, and the C compiler just inherits
this like any other occs program -- is built using curses, for which Holub
provides a really fairly irrelevant (in this context) description.
Even if you have the commercially supported MKS LEX/YACC, or Gnu FLEX/BISON,
you will want to get Holub's book to see how these programs--or at least
very similar programs--actually work. The Gnu software comes with source
code, but I wouldn't want to tackle it without first having read Holub's
book. I had a strange experience after finishing this book. In preparation
for this review, I took my copy of the Dragon Book down from the shelf yet
again, just to see if it really is such hard going. Oddly enough, I found
that I actually understand a lot of it past page 200. Clearly this is because
I've worked through Holub's book. If you, too, have been frustrated trying
to read the Dragon Book, first read Compiler Design in C (whose cover,
incidentally, shows four mice operating some sort of cheese-grater/mailbox
contraption). Then try to slay the dragon.
STATE next[NUM_STATES][NUM_INPUTS]; while (state = next(state, input)) if (state == ACCEPT) brake; else do_action(state);