Advanced regexes

If you are confused already, it is probably best that you re-read the last section before continuing - the expressions only get more complicated!

We have gone through basic and novice regexes - now we're onto the powerful stuff. Regexes allow you to use the characters +, *, ?, { }, $, and ^ outside of sets to have special meaning.

The first four affect the number of a pattern it should match, and the last two affect the position. + means "match one or more of the previous expression", * means "match zero or more of the previous expression", and ? means "match 0 or 1 of the previous expression".

Here are some examples:

<?php
    preg_match("/[A-Za-z]*/", $string);
    preg_match("/-?[0-9]+/", $string);
    preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string);
?>

The first expression will match "", "a", "aaaa", "The sun has got his hat on", and any other string containing uppercase and lowercase letters - the expression can be translated as "match zero or more uppercase and lowercase letters". The second regex will match 1, 100, 324343995, and also -1, -100, -234011, etc - the "-?" means "match exactly 0 or 1 minus symbols".

The last regex is fairly complicated, but, as always with regexes, complexity == power. As mentioned before, $ is a regex symbol in its own right, however here we proceed it with a backslash, which, unsurprisingly, works as an escape character turning the $ into a standard character and not a regex symbol. We then match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. What kind of text would that match? Here are some examples: $A, $B, $C, $foo, $bar, $Test99, $_MyTest, $__Foo__. Look familiar? That's right - that regex will match PHP variables.

Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. Firstly, {n}, where n is a positive number, will match n instances of the previous expression. Secondly, {n,} will match a minimum of n instances of the previous expression. Finally, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.

Here is a list of advanced regular expressions using braces, with string used to match, and whether or not a match is made:

Regex

String

Result

/[A-Z]{3}/

FuZ

No match; the regex will match precisely three uppercase letters

/[A-Z]{3}/i

FuZ

Match; same as above, but case insensitive this time

/[0-9]{3}-[0-9]{4}/

555-1234

Match; precisely three numbers, a dash, then precisely four. This will match local US telephone numbers, for example

/[a-z]+[0-9]?[a-z]{1}/

aaa1

No match; must end with one lowercase letter

/[A-Z]{1,}99/

99

No match; must start with at least one uppercase letter

/[A-Z]{1,5}99/

FINGERS99

No match; start with a maximum of 5 uppercase letters

/[A-Z]{1,5}[0-9]{2}/i

adams42

Match

Finally, we have the dollar $ and caret ^ symbols, which mean "end of line" and "start of line" respectively. Consider the following string:

$multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand the\ncaret symbol\nwork as planned";

As you know, \n means "new line", so what we have there is a string containing the following text:

This is
a long test
to see whether
the dollar
Symbol
and the
caret symbol
work as planned

In order to parse multi-line strings correctly, we need the "m" modifier, so "m" needs to go after the final slash. Here is some PHP code - which expressions do you think will match?

<?php
    preg_match("/is$/m", $multitest);
    preg_match("/the$/m", $multitest);
    preg_match("/^the/m", $multitest);
    preg_match("/^Symbol/m", $multitest);
    preg_match("/^[A-Z][a-z]{1,}/m", $multitest);
?>

The answer is "all of them" - they all match. Line one means "return true if 'is' is at the end of a line", line two is "return true if 'the' is at the end of a line", and line three is "return true if 'the' is at the end of a line". Line four is "return true if "Symbol" is at the start of a line", and line five is "return true if there is a capital letter followed by one or more lowercase letters at the start of a line.

As you can see, matching the beginning and end of a line is simple with the $ and ^ characters, but when combined with +, *, ?, and { }, your regular expression-matching ability should rocket upwards.

However, we're not finished yet, grasshopper - if you wish to attain regex nirvana, you need to understand the last few secrets of regex wisdom...

 

Want to learn PHP 7?

Hacking with PHP has been fully updated for PHP 7, and is now available as a downloadable PDF. Get over 1200 pages of hands-on PHP learning today!

If this was helpful, please take a moment to tell others about Hacking with PHP by tweeting about it!

Next chapter: Guru regexes >>

Previous chapter: Novice regexes

Jump to:

 

Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.