Book Review: Mastering Regular Expressions, 2nd Ed


Mastering Regular Expressions, 2nd Edition
Author: Jeffrey E. F. Friedl
Publisher: O’Reilly
ISBN: 0-596-0289-0
Pages: 432

This book is one of those books that absolutely every developer of almost any language should own a copy of. If you were to take every technical book away from me but one, this is the one I would choose to keep.

When I picked up this book a few years back in Borders, I figured I’d glance through it and see if it had any syntax tables. I felt pretty confident in my regular expression skills, figured I knew most of what there is to know, but sometimes stumbled on syntax. More importantly periodically I encountered a bizarre construct in someone else’s regular expression, and these things are incredibly difficult to Google. Have you ever searched for (?<!? It doesn’t work out so well.

What I found inside when I first opened it was a well-explained, easy to follow, and fairly in-depth discussion of various regular expression engine types, and the relative strengths and weaknesses of each. Digging further, I found that Friedl went into substantial depth on each engine type, giving examples of the sorts of regular expression which would trip it up, and explaining the performance of that regular expression in this engine compared to that engine. This was Chapter 2.

So in reality, are you going to need to know that sort of detail on a day-to-day basis while working with regular expressions? It’s not very likely. You’ll test your regular expression in the engine available to you and discover that it’s fast or that it’s slow, and tune it accordingly. Usually you don’t get the chance to choose which regexp engine you’re going to use. However it demonstrates the absolutely astounding level of knowledge and detail that Friedl gets into with this book.

This sort of background knowledge helps assimilate the concepts he communicates in later chapters, though he’s such an excellent communicator that you can easily understand what he says in later chapters, even if you don’t understand the background of why it is so.

This is the first technology book I ever sat and just read. I can’t profess to have retained all or maybe even most of what I read, the information is simply too dense, but it fundamentally changed my understanding of regular expressions.

Snippet

Lookahead (?=•••), (?!•••); Lookbehind, (?<=•••), (?<!•••)
Lookahead and lookbehind constructs (collectively, lookaround) are discussed with an extended example in the previous chapter’s “Adding Commas to a Number with Lookaround” (p 59). One important issue not discussed there relates to what kind of expression can appear within either of the lookbehind constructs. Most implementations have restrictions about the length of the text matchable within lookbehind (but not within lookahead, which is unrestricted).

The most restrictive rule exists in Perl and Python, where the lookbehind can match only fixed-length strings. For example, (?<!\w) and (?<!this|that) are allowed, but (?<!books?) and (?<!^\w+:) are not, as they can match a variable amount of text. In some cases, such as with (?<!books?), you can accomplish the same thing by rewriting the expression, as with (?<!book)(?<!books), although that’s certainly not easy to read at first glance.

The next level of support allows alternatives of different lengths within the lookbehind, so (?<!books?) can be written as (?<!book|books). PCRE (and the pcre routines in PHP) allow this.

The next level allows for regular expressions that match a variable amount of text, but only if it’s of a finite length. This allows (?<!books?) directly, but still disallows (?<!^\w+:) since the \w+ is open-ended. Sun’s Java regex package supports this level.

When it comes down to it, these first three levels of support are really equivalent, since they can all be expressed, although perhaps somewhat clumsily, with the most restrictive fixed-length matching level of support. The intermediate levels are just “syntactic sugar” to allow you to express the same thing in a more pleasing way. The fourth level, however, allows the subexpression within lookbehind to match any amount of text, including the (?<!^\w+:) example. This level, supported by Microsoft’s .NET languages, is truly superior to the others, but does carry a potentially huge efficiency penalty if used unwisely. (When faced with lookbehind that can match any amount of text, the engine is forced to check the look-behind subexpression from the start of the string, which may mean a lot of wasted effort when requested from near the end of a long string.)

,

  1. #1 by Brian at March 27th, 2008

    This is the first technology book you ever sat down and read?

    What is the true value of mastering regular expressions? I don’t doubt that this book is a good reference source, but placing this book above all other tech books may be a little much.

    What is in this book that I can’t find on google to get the same job done?

  2. #2 by eric stevens at March 27th, 2008

    Not the first technology book I’ve read, but the first one I’ve sat and read entire chapters of (normally I’ll skim through one, reading source code, and the details where the source code isn’t self-explanatory).

    I can’t say it’s the best technology book anywhere, ever, but it’s the best one I own.

    It gives you information which you’re going to have a hard time finding on Google (for example, being able to look up certain constructs – regexp constructs are not searchable). But it also gets into the level of detail I’ve never encountered natively on the web. For example:
    [Which engine type is being used,] DFA or POSIX NFA?
    Differentiating between a POSIX NFA and a DFA is sometimes just as simple. Capturing parentheses and backreferences are not supported by a DFA, so that can be one hint, but there are systems that are hybrid mix between the two engine types, and so may end up using a DFA if there are no capturing parentheses.

    Here’s a simple test that can tell you a lot. Apply X(.+)+X to a string like “==XX=========================”

    If it takes a long time to finish, it’s an NFA (and if not a Traditional NFA as per the test in the previous section, it must be a POSIX NFA). If it finishes quickly, it’s either a DFA or an NFA with some advanced optimization. Does it display a warning message about a stack overflow or long match aborted? If so, it’s an NFA.

    Ok, so how often do you really need to know the engine type. Not that often, it turns out, but I use this as an example of the sort of amazing and intricate detail he gets into. If there’s a question about regular expressions this book fails to answer, then it’s not too likely to appear on Google either.

  3. #3 by eric stevens at March 27th, 2008

    Maybe one of the reasons I believe it’s the greatest technology book I own is that it transcends languages. Almost no matter what computer language you use, regular expressions are available to you, and this book can help you optimize and tune your regular expressions, making some very complicated tasks very easy.

    It also describes in fantastic detail many constructs I didn’t even know existed until I read this book.

  4. #4 by Brian at March 27th, 2008

    I don’t know. When I have to use regex, then I use it. As far as the engine and all the detail, I’m sure its fascinating to some. For me, its just way too much information.

    I’m not trying to downgrade your review, or the book. I just think for John Q. Developer it is a bit much.

    Someday you might prove me wrong and I’ll need to either purchase my own copy or beg to borrow yours.

(will not be published)
  1. No trackbacks yet.