What are Regular Expressions?

Welcome back to the NaCreSoMo 2014 CS mini-tutorial. Yesterday we talked about state machines; today we’re going to put them to use. The examples you’ll see here are a little more abstract, but still no code yet. Lots of diagrams, though!

(And we’re not actually going to talk very much about the term “regular expression” until the end.)

  1. What is a Lexer?
  2. What are State Machines?

State machines are all over CS, but a particularly interesting kind has to do with seeing if a particular bit of text (a string) matches a particular pattern. For instance, American phone numbers always have three digits, three more digits, and then four more digits, so if you see something like “626-376-9503” (and you’re in America), it’s probably a phone number. Similarly, “3/20/2014” is probably a date, and “The (Something)” is probably an Animorphs book.

Let’s look at the last example: we’re trying to match two-word phrases starting with “The”. In that case, the state machine that handles this might look something like this:

The start state has one transition labeled "The", which leads to a second state. The second state has a transition labeled "(any word)", which leads to a final state. The final state has no transitions.

This state machine has two things we haven’t seen before: a final state and an error state. The final state is very useful because it answers the question for us: if we follow all the directions in our input and end up in the final state, it matches! If we don’t end up in the final state, it doesn’t match.

The error state tells us what to do if we get a direction we don’t expect: transition to this state, and stay there forever. (Notice how it has a transition that loops back to itself—no matter what input is left, you’re never going to leave the error state.) This is how you represent a string that doesn’t match. All the transitions to the error state always have the form “anything else”, though, so computer scientists don’t usually bother to draw it in. Therefore, if your input doesn’t match any of the transitions at your current state, the machine can just stop; the input doesn’t match the pattern.

What if you run out of directions, but you haven’t reached the final state or the error state? In that case, your input matches a prefix of the pattern, but not the whole thing. That’s still considered to be “no match”.

Let’s see how this machine does on some test input.

Input Result Matches?
“The Invasion” 3 Yes!
“The Ultimate” 3 Yes!
“The Andalite’s Gift” X No
“Visser” X No
“The iPhone” 3 Yes!
“The Final Countdown” X No
“The” 2 No
”” 1 No

It looks like the machine is behaving as expected. No, it’s not telling us which titles actually belong to Animorphs books, but it does tell us which ones fit the pattern.


Here’s another sci-fi-themed state machine; this one tells you if a title belongs to one of the six currently-released Star Wars movies. (Per common practice I’ve left out the error state this time.)

The start state has six transitions, each with the first word of each Star Wars movie. All of these states continue with the second, third, and possibly fourth words, leading to six separate final states

This introduces another new concept: multiple final states. In this case, it’s okay if we end up in any of the final states.

You might also notice that the very first “The” transition out of the start state is ambiguous: do we go to state 2 or state 16? The answer is: both! We don’t know whether the input is going to be “The Phantom Menace”, “The Empire Strikes Back”, or “The Wrath of Khan” when all we’ve seen is “The”. So we have to try both choices, and hopefully the next word will give us a clue which branch we should be following.

(Another way to think about this is that we can try one of the branches first. If it ends up in a final state, we’re good! If not, we rewind and try the other branch. This is called backtracking.)

Here’s another machine that matches the exact same input, but uses a slightly different pattern to do it:

This is the same as the diagram above, except that there is only one transition labeled "The", which leads to a state with two transitions: "Phantom" (for "The Phantom Menace"), and "Empire" (for "The Empire Strikes Back"). These each continue to the rest of their respective titles.

State machines like this, where there aren’t duplicate transition labels, are called “deterministic”, because you know exactly where you are going at each state. The opposite, where you have multiple possibilities, is called “non-deterministic”. So the first Star Wars machine is non-deterministic, and this second one is deterministic.

Extra credit: Prove that any non-deterministic state machine can be turned into a deterministic one that matches the exact same input possibilities.

Let’s see how these machines do with some test input. This time I’ll write the whole path leading up to the end, so you can “see it working”.

Input Path Matches?
“A New Hope” 1, 13, 14, 15 Yes!
“The Phantom Menace” 1, 2/16, 3, 4 Yes!
“The Empire Strikes Back” 1, 2/16, 17, 18, 19 Yes!
“Star Trek Into Darkness” 1, X No
“Attack on Titan” 1, 5, X No
“The Wrath of Khan” 1, 2/16, X No
“Revenge of the” 1, 9, 10, 11 No
”” 1 No

Here’s one more variant of the Star Wars machine that looks like it might reduce some redundancy. What’s wrong with it?

This is the same as the diagram above, but "Attack", "Revenge", and "Return" all lead to the same state. The state has one transition, "of", which leads to a state with one transition, "to". This latest state has three transitions, "Clones", "Sith", and "Jedi", all of which lead to final states.


Okay, enough with the sci-fi. Let’s turn to the problem I promised yesterday: matching e-mail addresses.

…well, gosh, would you believe all of these are (in theory) valid e-mail addresses?

username@example.com
username123_456@example.com
username+piece@example.com
username%something-else@example.com
user.name@example.com
!!!???@example.com
^_~@example.com
"this is actually my e-mail"@example.com
username(ignored)@example.com
(ignored)username@example.com

username@yahoo.co.uk
username@international.テスト
username@a.very.strange.subdomain.of.example.com
username@example.com.
username@example.com(ignored again)
username@[127.0.0.1]

*blink*

Okay, let’s start over. Rather than try to exactly match RFC 5322, let’s just go with how you’d describe a “normal” e-mail address to another human.

“Some letters and numbers and maybe symbols, then an ‘@’, then a domain name…which is some more letters and numbers separated by dots, with at least one dot.”

That rules out a few of the weird cases above, but that’s okay. Most mail hosts won’t let you use an e-mail that unusual anyway.

Let’s see if we can build a machine to match the pattern we described above. (This time, each transition will represent one letter, instead of one word.) Here we go, beginning with the start state…

Start state 1 only.

An e-mail address always has a “user name” part, so we can’t start off with an “@”. So the only valid input right now is a number, a letter, or a symbol that isn’t “@”.

Add transitions from start state 1 to a new state 2 for "any letter", "any number", or "any symbol besides '@'".

Now we have our first choice. If we see an “@” sign, we move foward to the second half of the e-mail.

Add a transition from state 2 to a new state 3 for "'@'".

But if we see something else, it’s just part of the user name, and we should keep looking for the “@”. Nothing’s really changed. Therefore, we just want to stay in the same state.

Add transitions from state 2 back to itself for "any letter", "any number", or "any symbol besides '@'".

After we’ve seen the “@”, we’re looking for a domain name. That has at least two parts, so we start off looking for a number or letter.

Add transitions from state 3 to a new state 4 for "any letter" and "any number"

Like before, if we see another letter or number, nothing really changes.

Add transitions from state 4 back to itself for "any letter" and "any number"

If we see a dot, though, we’ve made progress: we’ve seen at least one part of the domain name.

Add a transition from state 4 to a new state 5 for "'.'"

And now, if we see at least one number or letter, we’re done!

Add transitions from state 5 to a new final state 6 for "any letter" and "any number"

If we see any more letters or numbers, that’s fine.

Add transitions from final state 6 back to itself for "any letter" and "any number"

If we see another dot, though, we have to go back to look for more letters.1

Add a transition from final state 6 back to state 5 for "'.'"

Whew!

Okay, let’s test it out!

Input Path Matches?
a@b.c 1, 2, 3, 4, 5, 6 Yes!
ab@cd.ef 1, 2, 2, 3, 4, 4, 5, 6, 6 Yes!
ab@c 1, 2, 2, 3, 4 No
ab@c.e. 1, 2, 2, 3, 4, 5, 6, 5 No
ab@@c 1, 2, 2, 3, X No
ab@c@d 1, 2, 2, 3, 4, X No
ab.c 1, 2, 2, 2, 2 No
@c.d 1, X No
ab@c.e.g 1, 2, 2, 3, 4, 5, 6, 5, 6 Yes!

We did it!

Exercise: Come up with a similar matcher for URLs. “Several letters and numbers, followed by ‘://’, followed by a domain name (like in the e-mail address). After that, there may or may not be a slash (‘/’), followed by more letters, numbers, and symbols.


So, what are “regular expressions”? They’re just a particular compact way of writing the “patterns” we’ve been using all along here. (The name comes from a math / CS theory concept called “regular languages”, which are roughly “‘languages’ made of text that matches a set of patterns like these”.)

By definition, the patterns that define a regular language have to be built up using only these three rules:

  • Concatenation: “A followed by B”
  • Union: “A or B”
  • Repetition: “A, zero or more times”

From these rules, you can actually build up some pretty complex languages—all of the state machines we worked with today can be described using just concatenation, unions, and repetition. I’m not going to teach you the syntax of regular expressions here, but RegexOne has a great interactive tutorial if you’re interested in learning.

Next time: back to the code, and actually using our lexer for something.

The diagrams in this post were created using Keynote.

  1. In reality, a final dot in a domain name is actually legal: it means “this really is a top-level domain, not implicitly a subdomain of something else”. But the state machine is more interesting this way! ↩︎