What is a Parser?

Welcome back to another exciting episode of the NaCreSoMo 2014 CS mini-tutorial. Today we’re going to pick up where we left off in episode 1 by building a parser that uses the output of our lexer. A parser is another core part of a compiler: it takes the “words” (tokens, remember?) that come out of the lexer and tries to understand what they mean. If the lexer is about turning letters into words, a parser is making sentences.

  1. What is a Lexer?
  2. What are State Machines?
  3. What are Regular Expressions?

Before we start coding (and for those who aren’t going to follow along with the code), let’s look at a concrete example of parsing that pretty much everyone is, well, forced to learn in elementary school: PEMDAS.

12 • 7 - 6 • 9 + 2

What is the answer?

If you said “704” or even “-708”, you’d be on the right track. But it’d be weird if you said, say, “32”.

Oh wait, sorry, I got that backwards.

If you had never seen our arithmetic notation before, all you really have is a series of numbers and symbols: 12, , 7, -, 6, , 9, +, 2. (This is basically what comes out of a lexer!) How do you go from that to something you can actually calculate? Just grouping up the operations from left-to-right gives you the wrong answer.

(((12 • 7) - 6) • 9) + 2 = 704

Okay, how about right-to-left?

12 • (7 - (6 • (9 + 2))) = -708

To correctly parse this arithmetic expression, you have to know something about the content. Unlike the lexer, which basically just classified characters and chopped things up, a parser knows what the resulting “sentence” is going to look like. It’s imposing a grammar on the lexer’s words. In the arithmetic case, the grammar says that multiplication takes precedence over addition and subtraction, and that you go left-to-right when dealing with the “same kind” of operation.1

((12 • 7) - (6 • 9)) + 2 = 32

There we go.

Of course, most programming languages aren’t arithmetic problems. Instead, they describe a format—a grammar—for a particular set of instructions. When the user puts in code that matches the grammar, the parser can interpret the “sentences” and “phrases” (known as productions), and build up some structure to represent the information. The rest of the program then uses this structured information to accomplish some task; an “interpreter” will treat it as instructions about what to do next, a “compiler” will use it as the description of a new program it can then write out to a separate file, and so on.

Here’s a snippet of BASIC code that reads in a number and prints its factorial.

20 LET F = 1
30 FOR I = 1 TO N
40 LET F = F * I
70 END

I chose BASIC, an old, messy, and somewhat broken language family because the syntax is fairly simple. Let’s look at the various patterns we see here:

  • Every line starts with a number.
  • INPUT” means “read from the user”. Whatever follows it is the name of the variable to read into.
  • LET” puts a value into a variable. It’s followed by a variable name, an equal sign, and then some kind of expression. (Here we have both “1” and “F * L”).
  • FOR” and “NEXT” go together – whatever’s between them is run several times. The “FOR” is followed by a name, an equal sign, a starting value, the word “TO”, and an ending value.
  • PRINT” shows something to the user. Whatever follows it gets printed.
  • END” ends the program. It doesn’t have anything following it.

So after we’ve parsed this program, we might have this sort of structure:

    • 10: Read input (INPUT)
      • variable name = “N”
    • 20: Set value (LET)
      • variable name = “F”
      • value = “1”
    • 30: Loop! (FOR)
      • variable name = “I”
      • starting value = “1”
      • ending value = “N”
    • 40: Set value (LET)
      • variable name = “F”
      • value = multiply
        • multiplicand = “F”
        • multiplier = “N”
    • 50: Start the loop over (NEXT)
    • 60: Print output (PRINT)
      • value = “F”
    • 70: End the program (END)

When the structure represents code, it’s often called an abstract syntax tree. It’s “abstract” because it doesn’t exactly reflect the actual text any more. It’s “syntax” because it’s still based on the grammar of the language. And it’s a “tree” because of the hierarchical structure.

So, what are we actually going to build today? How about we come up with a grammar that describes state machines?

  • We need a way to specify the states. We can write that “state 1”. (Or any other number.)
  • We might want to name states, so let’s allow state labels to be either numbers or quoted text (like "abc").
  • We need a way to say that a state is a final state. We can write that as “final state 2”, or just “final 2” because the “state” is kind of redundant.
  • We need a way to specify the start state. Let’s write that as “start state 3”, or just “start 3”.
  • Oh wait…it’s possible for the start state to also be a final state. “start final state 4”? Or just “start final 4”? Works for me!

Okay, that’s pretty good for states. How about transitions?

  • Let’s say that after a new state, you put all of its transitions.
  • Let’s put a colon after states, then, so that it’s clear that everything that comes after them is part of that section. “state 5: …”
  • If we just want to match specific text, let’s put it in quotes. "abc", '@', etc.
  • We should allow things like “any letter”, too.
  • If we want to exclude things from an “any” pattern, we should allow you to put them first and have them transition to “error”.
  • After we’ve matched something, we need to know which state to go to.
  • ”->” looks like an arrow, so let’s write transitions as “the part we want to match, then an arrow, then the label for the state we should go to”

Aaaand…that’s pretty much it! If this exercise felt contrived to you, well…that’s pretty much how all of CS works. We made it up.

Here’s how you could write a description of the e-mail machine from last time in this new “state machine language” we’ve made:

start state 1:
    "@" -> error
    any number -> 2
    any letter -> 2
    any symbol -> 2
state 2:
    "@" -> 3
    any number -> 2
    any letter -> 2
    any symbol -> 2
state 3:
    any number -> 4
    any letter -> 4
state 4:
    "." -> 5
    any number -> 4
    any letter -> 4
state 5:
    any number -> 6
    any letter -> 6
final state 6:
    any number -> 6
    any letter -> 6
    "." -> 5

Can you write a description of the Animorphs machine? The Star Wars machine?

Okay, so we have a language now! Now, how do we turn the text we have above into a structured format? Well, we can see that there are basically two kinds of production in our language: productions that start a new state, and productions that add transitions to the current state. So in “pseudocode”, we want to do this:

until we run out of input:
    if the next thing looks like a new state:
        parse a "new state" production
        add the new "state structure" to our list of states
        make the new state our "current state", so we can add transitions to it
        parse a "transition" production
        if there's not a "current state" yet, complain
        add the new "transition structure" to the current state

Hey, that’s pretty good! Now we just have to do those other two pieces: parsing each kind of production.

to parse a "new state" production:
    if the next token is "start":
        remember that we're trying to make a start state
        move to the next token
    if the next token is "final":
        remember that we're trying to make a final state
        move to the next token
    if the next token is "state":
        we already know we're trying to make a new state
        just move to the next token
    the next token is the name of the state
    if it's not a valid name, complain
    the next token should be a colon
    if it's not a colon, complain
    construct a "state" with the given name and attributes (start? final? both?)
    return the new state to the parsing loop

to parse a "new transition" production:
    if the next token is "any":
        we're trying to match a set of things
        the next token tells us what set
        construct an "any-transition" based on that set
        we're trying to match specific text
        if the token isn't quoted text, complain
        construct a "plain-transition" based on that text
    the next two tokens should be "-" and ">" to make an arrow
    if that's not what we see, complain
    the next token should be the name of the state to transition to
    if it's not a valid name, complain
    record the destination name in the transition
    return the new transition to the parsing loop

Whew! Any questions? (I’ll answer ‘em down in the comments.)

There’s a bit more code this time, so I’m not going to put it here at all. Once again it’s available as a Gist on GitHub, along with a modified version of the lexer from last time.

(What changed in the lexer? It now recognizes numbers and the symbols “-”, “>”, and “:”, and has checks to see if something is a digit, space, or quote symbol in addition to “letter”. The basic loop also passes an extra nil when it finishes, to let the code using the lexer know that the input is done.)

You might notice there are a lot of things missing from our pseudocode—for instance, it doesn’t tell check that all the transitions go to real states, or that all the states have unique names. That’s not usually a parser’s job—like the famous “Colorless green ideas sleep furiously”, grammatical correctness does not necessarily imply meaning. We’ll tackle these issues next time, along with actually building and simulating a state machine based on one of these descriptions.

Postscript: “If this is where the tutorial was going all along, why check for parentheses and the comma in the lexer? They’re not actually used here!” Well, actually I was originally going to try to parse patterns (like those from the “What are Regular Expressions?” chapter). Then I realized that arbitrary pattern matching (including backtracking) is hard, and that I was writing some pretty hairy code to make it work. Although I think I got it working in the end, it wasn’t a great tutorial project.

  1. Except exponentiation. Why? Because (x^y)^z is the same as x^(y*z), so we don’t need another way to write it. Better to have x^y^z mean x^(y^z), which is usually a much bigger number. ↩︎