What is a Lexer?

Today’s NaCreSoMo creation is what’s known as a lexer. A lexer is a thing in Computer Science that takes input text and chops it up into words (often called tokens). In a language like English, the words are pretty much just separated by spaces…though you might want to separate out some punctuation as well if you’re trying to get a computer to understand the sentence.

If this is all too technical for you, take a look at a post from last year’s NaCreSoMo instead: “Interpreting Information“.

This lexer is written in Ruby, and is hopefully simple enough for you to follow along even without that much Ruby experience, or even much programming experience. I’ve put comments in the code (after the # marks) to help you out.

## Return true if l is a single letter (or underscore), false if not.
def letter?(l)
  case l
  when 'a'..'z', 'A'..'Z', '_'
    return true
  else
    return false
  end
end

## Splits the input into tokens and yields them to the given block.
def lex(input)
  # Delete extra spaces from the beginning and end.
  input.strip!
  until input.empty?
    tok = nil # See sanity check below.

    case input[0]
    when 'a'..'z', 'A'..'Z', '_'
      # Split on word boundaries.
      i = 0
      i += 1 while letter?(input[i])
      tok = input.slice!(0...i)

    when '"', "'"
      # Look for the same kind of quote later in the input.
      # A more complicated lexer would give you a way to put double quotes
      # inside other double quotes, usually by putting a backslash first: \".
      # This one doesn't bother.
      end_quote_idx = input.index(input[0], 1)
      raise "missing ending quote" if end_quote_idx.nil?
      tok = input.slice!(0..end_quote_idx)

    when '(', ')', ','
      # Just take the single character.
      tok = input.slice!(0)

    else
      raise "unexpected character: #{input[0]}"
    end

    raise "failed to account for a case" if tok.nil?
    yield tok

    # We ate a word. The next thing is probably one or more spaces. Delete them!
    input.lstrip!
  end
end

Essentially, the lexer looks at the first letter in the input and decides what to do based on that. If it’s a real letter1, it keeps reading until it’s seen the entire word, then saves that word as a token. If it’s a quote (double or single), it skips ahead until it sees the same kind of quote, and treats everything between the quotes as a single token, along with the quotes themselves. If it’s punctuation, the token is just the single character. And if it’s anything else (like, say, a number), it complains.

I wrote a lexer because lexers are a major part of compilers, the tools that take human-written source code and generate instructions for the computer to execute.2 The lexer is the very first step in compilation: taking a source code file and chopping it up into words so that the rest of the compiler can start trying to understand it. (You wouldn’t try to understand a sentence by reading it letter by letter, right?)

And compilers are what I work on for my day job, specifically the Clang compiler for the C, C++, and Objective-C languages. My job, for the most part, is about teaching the compiler to look for suspicious things in the source code and tell the programmer that they probably screwed up—essentially, getting the compiler to understand what the code means, rather than just what it does. I love it.

Getting back to this Creation, though…I didn’t just do this for itself. We’re going to do a bit more with this project later in the month. (Probably not tomorrow, though.)


If you want to play with this, it’s available as a Gist on Github in self-contained form. You’ll need to know how to run Ruby programs to use it, but once you have that you should be able to say ruby lexer.rb and then type in words, punctuation, and “strings” between single or double quotes. The program will then print out the tokens it finds, one per line.

If you’re interested in learning Ruby, there’s an easy, interactive tutorial at tryruby.org.

  1. Underscores are traditionally letters for programmers, in case you want to put_several_words_together_without_spaces. You can also jamWordsTogetherWithCapitalLetters. The latter is often called “camel case” and the former “snake case”. ↩︎

  2. Although…I was talking to a friend and realized that my real definition of “compiler” is a lot more general. To me, a compiler is a translator: it takes input, usually but not always human-readable, and converts it to another form, which is usually but not always lower-level. As an example of a compiler that goes the “other” way, there is a tool called Emscripten which allows you to compile a “low-level” programming language like C (i.e. a language that your operating system is written in) and end up with JavaScript (the language that runs in web browsers). ↩︎