Parser state machines in C
The messy wall of text below talks about this piece of code.
Parsers are often written as finite state machines. An FSM’s next state is a function of its input and current state, which works pretty well for a parser (‘input’ is current character being read, ‘state’ could be things like, ‘reading a string’ or ‘in a block’ or so).
The nscript parser works as an FSM. In fact, because of the nature of nscript, there is very little ‘look ahead’ involved (this is especially because it’s stack based). So the parser also does the execution. Check out the following nscript ‘statement’:
var 1 ==
{
"one" print
}
{
"not one" print
}
ifelse
What the code does is pretty obvious. Let’s see how it’s parsed. The ‘nscript grammar’ does not define any special construct like ‘a b c ifelse’ unlike in other languages. All that happens here is, a boolean and two blocks are pushed onto stack, and ‘ifelse’ is called. ifelse reads the blocks and the boolean off the stack just like a normal function. So the parser just reads names and executes them. Names needn’t only be alphanumeric, which is why so-called ‘operators’ like ‘=’ work. No special grammar involved.
The part of the nscript code relevant to parsing can be found in definition of the ns_interpret function. The FSM structure is very apparent: It’s a pretty simple switch-case FSM, with an enumeration defining the possible states and an integer ‘mode’ that remembers the current one. Within each state, the current character is checked, and stuff is done. One of the simplest states here is MD_READSTR – if a valid string character (including escape sequences), add it to the buffer; if end of string, push a string object with the contents of the buffer onto the stack and go back to MD_NONE.
Thus, the code has the following basic form:
currCase = case1
for each character:
switch currCase:
case1:
if input char follows some conditions:
do stuff
else
do other stuff
currCase = case2 #jump to a different case
case2:
if input char follows some conditions:
do stuff
else
do other stuff
# etc.
Recently I’d come across (can’t find the link again, will update this post when I do) an interesting way to implement such state machines in C. This method uses a jump table that associates characters (ASCII values) with jump positions to compute the next state. Such tables can be written in a pretty neat way using GCC extensions for array syntax. This isn’t actually very different from the method described above, since switch statements are usually implemented as jump tables anyway, but what makes it interesting is the direct character → out_state map for each in_state and the ‘cool’ factor (I dunno about you, but I thought it was pretty cool :P ).
Here’s an example table that could be used for this:
void *state[] =
{
[0 ... 255] = &&default,
['0' ... '9'] = &&digit,
['A' ... 'Z'] = &&caps
};
Here, ‘default’, ‘digit’ and ‘caps’ are goto labels defined elsewhere. Now you can just do:
goto *state[c];
Which is the same as ‘goto default’, ‘goto digit;’ or ‘goto caps;’ depending on the value of c.
This little example I wrote (the same one linked to at the top) uses the above method.