How to parse text into tokens

Although there are all sorts of fancy things you can do with regular expressions, the easiest option is just to have a function that reads in a character, eliminating all whitespace, and attempts to convert that character into a type it knows of. For example, if the character returned is a letter, the tokeniser should keep reading letters until it finds a non-letter symbol, then it should return the string to the parser.

To handle this, we'll create a function called gettoken(), which scans the input file looking for just one matching token. This must be called by a parent function that calls gettoken() repeatedly until a semi-colon is found, at which point we have a whole statement and should execute the line.

In order to create the lexer, we need some token types to match. From the example lines above, we need to have a number token type, a variable type, an assignment type, a print type, a string type, a semi-colon type, and a plus type. Rather than having strings to handle these, it's better to use numeric constant values that match up with easy-to-remember values, thus we'll be using these:

define("FOO_NUMBER", 0);
define("FOO_VARIABLE", 1);
define("FOO_ASSIGNEQUALS", 2);
define("FOO_PRINT", 3);
define("FOO_STRING", 4);
define("FOO_SEMICOLON", 5);
define("FOO_PLUS", 6);
define("FOO_MULTIPLY", 7);

The numbers assigned are basically irrelevant, as long as you're always careful to never give two constants the same value.

We also need to know what we consider to be a character for variables. The minimum definition is usually A-Z, a-z, and "_", and we'll stick to that for now. Therefore we need to have a global definition of the $characters array as this:

$characters = array_merge(range('a', 'z'), range('A', 'Z')); $characters[] = "_";

We also need a script to parse:

$script = fopen("", "r");

Finally, we also need a place to store variables, so put this line in there too:

$variables = array();

We will be storing the last value read in by gettoken() in the $lasttoken variable, which will be global across the whole script. With these in place we have enough to implement the gettoken() function - be prepared for a whole lot of code!

function gettoken() {
    GLOBAL $script, $characters, $lasttoken;
    $c = 0;

    // delete whitespace
    while (($c = fgetc($script)) == ' ' || $c == "\t" || $c == "\n" || $c == "\r");

    // exit if EOF is reached
    if (feof($script)) exit;

    // match numbers
    if (is_numeric($c)) {
        $nextchar = fgetc($script);

        while(is_numeric($nextchar)) {
            $c .= $nextchar;
            $nextchar = fgetc($script);

        // the last character read was not a number, put it back
        fseek($script, -1, SEEK_CUR);
        $lasttoken = $c;
        return FOO_NUMBER;

    if ($c == "=") {
        return FOO_ASSIGNEQUALS;

    if ($c == "+") {
        return FOO_PLUS;

    if ($c == "*") {
        return FOO_MULTIPLY;

    if ($c == ";") {
        return FOO_SEMICOLON;

    if ($c == "\"") {
        $nextchar = fgetc($script);

        while($nextchar != "\"") {
            if ($nextchar == "\n") {
                die("Fatal error: Unterminated string\n");
            $c .= $nextchar;
            $nextchar = fgetc($script);

        // note, we don't put the last character back here as it is the closing double-quote
        // trim off the double quote at the beginning
        $lasttoken = trim($c, "\" \t\n\r");
        return FOO_STRING;

    if (is_string($c)) {
        $nextchar = fgetc($script);
        while($nextchar != "\n" && in_array($nextchar, $characters)) {
            $c .= $nextchar;
            $nextchar = fgetc($script);

        // last character was not a letter, put it back
        fseek($script, -1, SEEK_CUR);
        $lasttoken = trim($c);

        // is this a print statement? If so, it's special
        switch($lasttoken) {
            case "print":
                return FOO_PRINT;
                return FOO_VARIABLE;

Yes, that's a lot of code, but it is not hard at all once broken up. The empty while loop is there to automatically ignore characters that are either \n (new line), \t (tab), " " (spaces), or \r (carriage return), leaving just the tokens we care about. I have added comments throughout to point out the tricky parts, but essentially the single character that was read in is compared against various options - is it numeric? Is it a semi-colon?

When a match is found, things start heating up. For example, if the first character is a number (e.g. 1), the next four characters might be the rest of the number, e.g. 2345, giving 12345. In this situation we need to get the other characters to make up the full number, so a loop is used with a symbol is_numeric() check - as long as each new character is a number, add it to the existing number. Otherwise, we've got something else in the next character (e.g. a semi-colon) so we need to reverse back in the file by one so that the token will be re-read, and return the number back for parsing.

The $lasttoken value is used to store the value of the token found, whereas the actual function returns a token type. The $lasttoken variable is global, which means our parser will be able to access it directly as soon as the gettoken() function returns.

That's about it for tokenizing, surprisingly enough - remember, the sole job of the function is to hunt through the file to find just one token and return it to the parser. Thus, gettoken() will be called many times in the script, only exiting when the feof() ("are we at the end of the file?" function call inside gettoken() returns true.

The parser is a little more difficult, although, as it needs to push tokens onto our stack. This begs the question, what is a token?


Want to learn PHP 7?

Hacking with PHP has been fully updated for PHP 7, and is now available as a downloadable PDF. Get over 1200 pages of hands-on PHP learning today!

If this was helpful, please take a moment to tell others about Hacking with PHP by tweeting about it!

Next chapter: What is a token? >>

Previous chapter: Planning it out

Jump to:


Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.