How parsing works

As mentioned already, the second stage of analysis is syntax analysis, which means we need to verify that the tokens have come into us in the right order. This is actually fairly easy to do if you have the parser generator Bison to hand, but PHP's equivalent (phpJay) is not very well developed right now. So, instead we're going to call gettoken() repeatedly until we get a semi-colon, at which point we will execute the line.

Converting to Reverse Polish Notation (RPN) is quite straightforward, but it requires a little thinking. The easiest way to handle the conversion is to totally ignore operator precedence, so this is the route we will take for now - this will be added later. In order to convert to RPN, we need to have two arrays in place: one for the expression, and one for the operators. As the tokens are parsed, all the non-operator tokens are put directly into the expression array as they are sent in, and all the operators are put into the operator array. Once the end of the statement is reached, all the operators are put onto the end of the expression array in reverse order.

Here is how that looks in PHP:

function main() {
    GLOBAL $lasttoken;

    while(1) {
        $stack = array();
        $operators = array();

        do {
            $token = gettoken();

            switch($token) {
                case FOO_NUMBER:
                case FOO_VARIABLE:
                case FOO_STRING:
                    $stack[] = new token(IS_OPERAND, $token, $lasttoken);
                    if ($token != FOO_SEMICOLON) {
                        $operators[] = new token(IS_OPERATOR, $token, NULL);
        } while ($token != FOO_SEMICOLON);

        // move operators to stack
        while (count($operators)) array_push($stack, array_pop($operators));

        // execute line!

There are three important parts to that function: the call to gettoken(), the switch/case block, and the while(count($operators)) loop at the end.

The entire main() is one big while loop that keeps looping as long as "1", and as 1 will always be true, this is an infinite loop. No, it's not as bad as it seems - this loop will not carry on forever! If you recall our gettoken() function calls exit() if it encounters the end of the file, which means our script will be terminated from within gettoken(). So, as far as we're concerned, main() should carry on looping until told to exit by gettoken().

The call to our gettoken() function is important as it is inside this loop. This means that main() will keep requesting tokens until the script ends, which is just what we want it to do. As you know, gettoken() returns a constant such as FOO_NUMBER to tell our main() function (the parser) what it picked up. It also sets the $lasttoken variable, which was why we declared it as global at the start of the function.

Once we know what token was read, we go into a switch/case block to decide where the token should be added to. If the token was a number, string, or variable, it's an operand, so we create a new token object, passing in IS_OPERAND, the token type, and its actual value, and add the new token to our $stack array. If it's anything else, we need to examine it more closely, which is where the "default" block comes in. If the token is a semi-colon, we don't want to have it - the loop terminates when $token is FOO_SEMICOLON, so just by ignoring the value we terminate the loop and execute the line. If the token is anything else, however, we create a new token and add it to the $operators array.

The third important part is executed once we have our complete $stack and $operator arrays. At this point we need to pop the operators from $operator and push them onto $stack. This needs to be done in reverse order, which means we take the last operator from $operators, and add it to the end of $stack, and repeat until $operators is empty. This is all accomplished in one line: array_push($stack, array_pop($operators)). So the whole loop translates to, "while there are still operators, add to the end of the stack the end of the operators".

At this point we have our full RPN stack, and need to execute it. This is done through another function, cleverly called execute(), which takes a reference to the stack to execute.


Next chapter: Finally, execution >>

Previous chapter: What is a token?

Jump to:


Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.