The analysis stage can be broken down into three sub-parts: lexical analysis, syntax analysis, and semantic analysis. The first stage reads the input script character by character to find tokens - parts of the script that are logical units by themselves. In PHP terms, a string is a token, as is a variable, a number, a brace, etc. As the lexical analyser (usually just called the lexer) finds tokens within a script, it returns them to the syntax analyser (parser), which makes sure that the script fits the syntax rules you have laid down. Finally, once the script has been tokenised and parsed, the semantic analyser kicks in to make sure the meaning of the script is valid.

Make sense? Right, onto the output stage...

Just kidding. You probably want some examples to clarify the above, and I wouldn't blame you!

The lexer scans through the input script looking for tokens to match. Consider the following line of pseudocode:

i = 10 * 10;

A lexer for that language would read it through and return the following:


The actual values i and 10 would get returned also for later use, along with the actual token types found. The parser has a long list of these tokens, and knows the order in which they are legally allowed to appear. For example, a parser might have the following two rules:

statement = VARIABLE ASSIGN_EQUALS expression

Here you can see that the parser would know that an expression is a number multiplied by another number. Naturally this is a gross over-simplification, as there is a lot you can do other than just multiply two numbers together! However, notice also that the parser knows that a statement is a VARIABLE token followed by an ASSIGN_EQUALS token, followed by an expression. What happens if the lexer returns VARIABLE ASSIGN_EQUALS VARIABLE? Well, in the example above the parser would error out because it doesn't know how to handle that eventuality.

Think in terms of PHP for a moment. How often have you seen an error message like this one?

PHP Parse error: parse error, unexpected '='

That's PHP's parser in action, saying that it doesn't have a rule to match the = symbol in the current context.

Moving on, semantic analysis is there to decide whether the code is meaningful. For example, the following is syntactically correct in PHP, but semantically incorrect:

$foo = in_array($myvalue, $myarray, 1, 2, 3, 4, 5, 6, 7, 8);

In that example we pass in a value and an array to in_array(), but then eight other parameters. The whole line is syntactically correct - there function is perfectly formed, with all the commas and brackets in the right place - but semantically incorrect as in_array() only takes three parameters. Thus the code is essentially meaningless and a warning must be issued, like this:

PHP Warning: Wrong parameter count for in_array()

So, the three stages of analysis: check we have matching tokens, check the tokens fit together into valid groups, and check the overall code is actually meaningful.


Want to learn PHP 7?

Hacking with PHP has been fully updated for PHP 7, and is now available as a downloadable PDF. Get over 1200 pages of hands-on PHP learning today!

If this was helpful, please take a moment to tell others about Hacking with PHP by tweeting about it!

Next chapter: Output >>

Previous chapter: The elements of a compiler

Jump to:


Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.