Advanced regexes

If you are confused already, it is probably best that you re-read the last section before continuing - the expressions only get more complicated!

We have gone through basic and novice regexes - now we're onto the powerful stuff. Regexes allow you to use the characters +, *, ?, { }, $, and ^ outside of sets to have special meaning.

The first four affect the number of a pattern it should match, and the last two affect the position. + means "match one or more of the previous expression", * means "match zero or more of the previous expression", and ? means "match 0 or 1 of the previous expression".

Here are some examples:

    preg_match("/[A-Za-z]*/", $string);
    preg_match("/-?[0-9]+/", $string);
    preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string);

The first expression will match "", "a", "aaaa", "The sun has got his hat on", and any other string containing uppercase and lowercase letters - the expression can be translated as "match zero or more uppercase and lowercase letters". The second regex will match 1, 100, 324343995, and also -1, -100, -234011, etc - the "-?" means "match exactly 0 or 1 minus symbols".

The last regex is fairly complicated, but, as always with regexes, complexity == power. As mentioned before, $ is a regex symbol in its own right, however here we proceed it with a backslash, which, unsurprisingly, works as an escape character turning the $ into a standard character and not a regex symbol. We then match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. What kind of text would that match? Here are some examples: $A, $B, $C, $foo, $bar, $Test99, $_MyTest, $__Foo__. Look familiar? That's right - that regex will match PHP variables.

Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. Firstly, {n}, where n is a positive number, will match n instances of the previous expression. Secondly, {n,} will match a minimum of n instances of the previous expression. Finally, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.

Here is a list of advanced regular expressions using braces, with string used to match, and whether or not a match is made:






No match; the regex will match precisely three uppercase letters



Match; same as above, but case insensitive this time



Match; precisely three numbers, a dash, then precisely four. This will match local US telephone numbers, for example



No match; must end with one lowercase letter



No match; must start with at least one uppercase letter



No match; start with a maximum of 5 uppercase letters




Finally, we have the dollar $ and caret ^ symbols, which mean "end of line" and "start of line" respectively. Consider the following string:

$multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand the\ncaret symbol\nwork as planned";

As you know, \n means "new line", so what we have there is a string containing the following text:

This is
a long test
to see whether
the dollar
and the
caret symbol
work as planned

In order to parse multi-line strings correctly, we need the "m" modifier, so "m" needs to go after the final slash. Here is some PHP code - which expressions do you think will match?

    preg_match("/is$/m", $multitest);
    preg_match("/the$/m", $multitest);
    preg_match("/^the/m", $multitest);
    preg_match("/^Symbol/m", $multitest);
    preg_match("/^[A-Z][a-z]{1,}/m", $multitest);

The answer is "all of them" - they all match. Line one means "return true if 'is' is at the end of a line", line two is "return true if 'the' is at the end of a line", and line three is "return true if 'the' is at the end of a line". Line four is "return true if "Symbol" is at the start of a line", and line five is "return true if there is a capital letter followed by one or more lowercase letters at the start of a line.

As you can see, matching the beginning and end of a line is simple with the $ and ^ characters, but when combined with +, *, ?, and { }, your regular expression-matching ability should rocket upwards.

However, we're not finished yet, grasshopper - if you wish to attain regex nirvana, you need to understand the last few secrets of regex wisdom...


Next chapter: Guru regexes >>

Previous chapter: Novice regexes

Jump to:


Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.