Guru regexes

To become a true regex monk, you must understand the final five patterns, all of which are easy. However, you must also understand the ways of the third parameter to preg_match() - we will come onto that soon.

While there are many other patterns for use in regular expressions, they generally aren't very common. So far we've looked at all but five of the most common ones, which leaves us with . (a full stop), \s, \S, \b, and \B.

The pattern . will match any single character except \n (new line). Therefore, c.t will match "cat", but not "cart".

The next two, \s and \S, equate to "match any whitespace" and "match any non-whitespace" respectively. That is, if you specify [\s\S], your regular expression will match any one character, regardless of what it is, and if you use [\s\S]* your regular expression will match anything.

The last two patterns, \b and \B, equate to "on a word boundary" and "not on a word boundary" respectively. That is, if you use the regex /oo\b/ it will match "foo", "moo", "boo", and "zoo" because the "oo" is at the end of the word, but not "fool", "wool", or "pool" because the "oo" is inside the word. The \B pattern is the opposite, which means it would match only patterns that aren't on the edges of a word - using the previous example, "fool", "wool", and "pool" would be matched, whereas "foo", "moo", "boo", and "zoo" would not.

Here are some examples:

<?php
    $string = "Foolish child!";
    if (preg_match("/oo\b/i", $string)) {
        // we will not get here
    }

    preg_match("/oo\B/i", $string);
    preg_match("/[\S]{7}[\s]{1}[\S]{6}/", $string);
?>

The last preg_match() matches precisely seven non-whitespace characters, followed by one whitespace character, followed by six non-whitespace characters - the exact string.

That has brought us to the end of the list of regular expressions for pattern matching. However, before we move onto using regular expressions for other things, it is important you understand how the fourth parameter works. Learning this is probably more advanced that most users will need, so we're only going to touch on it before we move onto greener pastures.

The fourth parameter for preg_match() allows you to pass in an array for it to store a list of matched strings. Consider this script:

<?php
    $a = "Foo moo boo tool foo!";
    preg_match("/[A-Za-z]oo\b/i", $a, $myarray);
?>

The regex there translates to "match all words that start with an uppercase or lowercase letter followed by "oo" at the end of a word, case insensitive". After running, preg_match() will place all the matched patterns in the string $a into $myarray, which you can then read for your own uses.

Now, if you remember, preg_match() returns as soon as it finds its first match, because most of the time we only want to know whether a string exists as opposed to how often it exists. As a result, our fourth parameter is not working as hoped quite yet - we need another function, preg_match_all() to get this right.

Preg_match_all() works just like preg_match() - it takes the same parameters (apart from in very complicated cases you are unlikely to encounter), and returns the same values. Thus, with no changes, the same code works fine with the new function:

<?php
    $a = "Foo moo boo tool foo!";
    preg_match_all("/[A-Za-z]oo\b/i", $a, $myarray);
    var_dump($myarray);
?>

This time $myarray is populated properly - but what does it contain? Many regex writers write complicated expressions to match various parts of a given string in one line, so $myarray will contain an array of arrays, with each array element containing a list of the strings the preg_match_all() found.

Line three of the script calls var_dump() on the array so you can see the matches preg_match_all() picked up. The var_dump() function simply outputs the contents of the variable(s) passed to it for closer inspection, and is particularly useful with arrays and objects. You can read more on var_dump() later on.

That is all for standard regexes now - let's take a look at using regular expressions to alter strings, and how it can be better than str_replace().

Author's note: As you've seen, regular expressions are a text-based representation of how you'd like a string parsed. As they are text-based, they need to be parsed by PHP before they can be executed, which isn't a fast process. Fortunately, the PCRE library that performs these regexes automatically caches the compiled versions of the regexes for maximum performance. This cache is cleaned periodically to ensure that it doesn't hog memory.

 

Want to learn PHP 7?

Hacking with PHP has been fully updated for PHP 7, and is now available as a downloadable PDF. Get over 1200 pages of hands-on PHP learning today!

If this was helpful, please take a moment to tell others about Hacking with PHP by tweeting about it!

Next chapter: Regular expression replacements >>

Previous chapter: Advanced regexes

Jump to:

 

Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.