Planning it out

Okay, here's where the theory begins. Like I said I've planned this out in such a way that hopefully everyone should be able to follow even if they have little knowledge of compiler mechanics.

Our language, which we shall called FooLang for argument's sake, will work by scanning a line of text, terminated by a semi-colon, and executing it. The following will all be valid examples of FooLang lines:

a = 45;
b = a;
print a;
d = a + b + a;
c = "Foo bar baz";
print "Baz bar foo";

As you can see, any text outside of strings that isn't "print" is considered a variable. Thus, the first line assigns 45 to a, the second assigns a to b, etc. The four example is of particular interest as it is a complex statement - there are three operators in there, and each one needs to be handled individually.

Handling simple statements such as "a = 10;" is quite easy - match variable, match equals, and match number. However, allowing statements that can be of any arbitrary size, such as "a = 10 * 10 + 5 - 3 * 4 - 9 / 2 * 10" isn't just a matter of writing dozens of rules for each possibility. Instead, we need some recursion. Ignoring problems such as operator associativity, our long equation above can be thought of like this:

a = 10 * (10 + (50 - (3 * (4 - (9 / (2 * 10))))));

That is, we multiply the last two first, then take the result of that and use it as the divisor for 9, then take the result of that and subtract it from 4, etc. Essentially what is happening is that the last three items are being taken off the equation, executed, and the result is being put back onto the end of the equation. This is then repeated again and again like this:

a = 10 * (10 + (50 - (3 * (4 - (9 / (2 * 10))))));
a = 10 * (10 + (50 - (3 * (4 - (9 / 20)))));
a = 10 * (10 + (50 - (3 * (4 - 0.45))));
a = 10 * (10 + (50 - (3 * 3.55)));
a = 10 * (10 + (50 - 10.65));
a = 10 * (10 + 39.35);
a = 10 * 49.35;
a = 493.5

At the end we have our simple rule, VARIABLE EQUALS NUMBER, so all we have to do is write a handler for that. Previously we looked at using an array like a double-ended queue, which means adding and removing items to and from the front and back of the array. If we only add or remove items from the back, what we'd have is the "stack" data structure, and we can manipulate our array stack using array_push() to push an item onto the end of the stack and array_pop() to pop an item off the stack.

Now look at the first two lines of the expression break down again:

a = 10 * (10 + (50 - (3 * (4 - (9 / (2 * 10))))));
a = 10 * (10 + (50 - (3 * (4 - (9 / 20)))));

If we're to use a stack to store this expression, this would look something like this (actions are in bold ):

a = 10 * (10 + (50 - (3 * (4 - (9 / (2 * 10))))));
pop 10
a = 10 * (10 + (50 - (3 * (4 - (9 / (2 *)))));
pop *
a = 10 * (10 + (50 - (3 * (4 - (9 / (2)))));
pop 2
a = 10 * (10 + (50 - (3 * (4 - (9 /))));
multiply 2 * 10
push result
a = 10 * (10 + (50 - (3 * (4 - (9 / 20)))));

Thus, a stack works perfectly for our needs. However, there's one more thing to consider: given that code needs to be executed in precisely the same way it was written, how can you be sure you're executing things correctly? That is, in the example above you can see the last two operations will be 2*10 then 9/20, but, if we start at the end, how do we know where to pass the result of 2*10?

The answer is that we need to set up recursion of the calculations, and this is best done using Reverse Polish Notation. This is a fancy term meaning that operators are postfix, or appended to the two operands.

For example, 2*10 becomes 2 10 *. This actually eliminates any problems of ambiguity, because a given expression can only be evaluated in one way. Our original expression was this:

a = 10 * (10 + (50 - (3 * (4 - (9 / (2 * 10))))));

In postfix notation, this is equal to

a 10 10 50 3 4 9 2 10 * / - * - + * =

Yes, that might look complicated to a reader, but to a computer it makes perfect sense, and will be analysed like this:

a 10 10 50 3 4 9 2 10 * / - * - + * =
pop =
a 10 10 50 3 4 9 2 10 * / - * - + *
pop *
a 10 10 50 3 4 9 2 10 * / - * - +
pop + (then, to save space here, it will also pop -, *, -, and /)
a 10 10 50 3 4 9 2 10 *
pop *, 10, and 2
a 10 10 50 3 4 9
calculate 10 * 2, push result, push previous operator
a 10 10 50 3 4 9 20 /
pop /, 20, and 9
Calculate 9 / 20, push result, push previous operator
a 10 10 50 3 4 0.45 -
Etc...

Essentially the code would pop off an element, and, if it's an operator, it would pop off another, and another, etc, until it finds two operands. These operands are then calculating with the last operator, and the result is pushed back onto the stack along with the previous operator. This process is then repeated and repeated until the stack is empty.

Of course, rather than pushing data back onto the stack, it's better to use a recursive function call, but I'm getting ahead of myself - let's start at the very beginning: how to parse text into tokens

 

Want to learn PHP 7?

Hacking with PHP has been fully updated for PHP 7, and is now available as a downloadable PDF. Get over 1200 pages of hands-on PHP learning today!

If this was helpful, please take a moment to tell others about Hacking with PHP by tweeting about it!

Next chapter: How to parse text into tokens >>

Previous chapter: Output

Jump to:

 

Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.