Posts Tagged regular expressions (regexp)

ColdFusion REMatchAll

This ColdFusion method offers functionality similar to PHP’s preg_match_all function. It searches for the supplied regular expression in the supplied text. The return value is an array with one entry for each time the pattern matches the string. The array entries are structs with a numbered element for each parenthesized sub-expression within the match, and a zero-entry for the whole match.

It’s probably easier to see example data.


Screenshot of the dump result from a ReMatchAll call

As you can see, it returns every match, position, and full text of the match, as well as each parenthesized subexpression. The example pattern basically matches {foo|bar|baz}, {foo|bar}, or {foo}, and returns the alphabetical sub-components as the sub-expressions.

Here is the code.

, ,

No Comments

Regular Expression for Validating Email Addresses

This is the regular expression I use to validate email addresses:


Thought it might be useful to some folks. Most email validation regular expressions fail to allow all the valid characters before the @ sign (for example, you can have a +, an & slashes, a single quote, =, ?, ^, _, {, }, ~, *).

In ColdFusion, you can test an address with:

In Javascript, you can test with:

In PHP, you can test with:

,

2 Comments

Book Review: Mastering Regular Expressions, 2nd Ed

Mastering Regular Expressions, 2nd Edition
Author: Jeffrey E. F. Friedl
Publisher: O’Reilly
ISBN: 0-596-0289-0
Pages: 432

This book is one of those books that absolutely every developer of almost any language should own a copy of. If you were to take every technical book away from me but one, this is the one I would choose to keep.

When I picked up this book a few years back in Borders, I figured I’d glance through it and see if it had any syntax tables. I felt pretty confident in my regular expression skills, figured I knew most of what there is to know, but sometimes stumbled on syntax. More importantly periodically I encountered a bizarre construct in someone else’s regular expression, and these things are incredibly difficult to Google. Have you ever searched for (?<!? It doesn’t work out so well.

What I found inside when I first opened it was a well-explained, easy to follow, and fairly in-depth discussion of various regular expression engine types, and the relative strengths and weaknesses of each. Digging further, I found that Friedl went into substantial depth on each engine type, giving examples of the sorts of regular expression which would trip it up, and explaining the performance of that regular expression in this engine compared to that engine. This was Chapter 2.

So in reality, are you going to need to know that sort of detail on a day-to-day basis while working with regular expressions? It’s not very likely. You’ll test your regular expression in the engine available to you and discover that it’s fast or that it’s slow, and tune it accordingly. Usually you don’t get the chance to choose which regexp engine you’re going to use. However it demonstrates the absolutely astounding level of knowledge and detail that Friedl gets into with this book.

This sort of background knowledge helps assimilate the concepts he communicates in later chapters, though he’s such an excellent communicator that you can easily understand what he says in later chapters, even if you don’t understand the background of why it is so.

This is the first technology book I ever sat and just read. I can’t profess to have retained all or maybe even most of what I read, the information is simply too dense, but it fundamentally changed my understanding of regular expressions.

Snippet

Lookahead (?=•••), (?!•••); Lookbehind, (?<=•••), (?<!•••)
Lookahead and lookbehind constructs (collectively, lookaround) are discussed with an extended example in the previous chapter’s “Adding Commas to a Number with Lookaround” (p 59). One important issue not discussed there relates to what kind of expression can appear within either of the lookbehind constructs. Most implementations have restrictions about the length of the text matchable within lookbehind (but not within lookahead, which is unrestricted).

The most restrictive rule exists in Perl and Python, where the lookbehind can match only fixed-length strings. For example, (?<!\w) and (?<!this|that) are allowed, but (?<!books?) and (?<!^\w+:) are not, as they can match a variable amount of text. In some cases, such as with (?<!books?), you can accomplish the same thing by rewriting the expression, as with (?<!book)(?<!books), although that’s certainly not easy to read at first glance.

The next level of support allows alternatives of different lengths within the lookbehind, so (?<!books?) can be written as (?<!book|books). PCRE (and the pcre routines in PHP) allow this.

The next level allows for regular expressions that match a variable amount of text, but only if it’s of a finite length. This allows (?<!books?) directly, but still disallows (?<!^\w+:) since the \w+ is open-ended. Sun’s Java regex package supports this level.

When it comes down to it, these first three levels of support are really equivalent, since they can all be expressed, although perhaps somewhat clumsily, with the most restrictive fixed-length matching level of support. The intermediate levels are just “syntactic sugar” to allow you to express the same thing in a more pleasing way. The fourth level, however, allows the subexpression within lookbehind to match any amount of text, including the (?<!^\w+:) example. This level, supported by Microsoft’s .NET languages, is truly superior to the others, but does carry a potentially huge efficiency penalty if used unwisely. (When faced with lookbehind that can match any amount of text, the engine is forced to check the look-behind subexpression from the start of the string, which may mean a lot of wasted effort when requested from near the end of a long string.)

,

4 Comments