1. Computing

Regular Expression Syntax

Creating Powerful Regular Expressions

By

Regular expressions are cryptic and sometimes outright unreadable. Though at first they may seem difficult, the terse and cryptic syntax allows you to create powerful regexen quickly and in a compact form.

Regular expressions are composed of two types of things: elements and operators. Though there are many operators, operators known as quantifiers are the most commonly used.

Elements

An element is a character or sequence of characters to be matched. This is the meat of the regexp, the part that defines exactly what the characters are. Here are a few examples that use only elements and which show off several types of elements.

  • /foo/ - All three letters in this regex are individual elements. The regex as a whole will match the word "foo".

  • /f[aeiou]n/ - The character class in the brackets specifies an entire set of characters that will match. The three separate elements in this regexp (the f and n characters, plus the character class) will match fan, fen, fin, fon, or fun.

  • /fun\s/ - The \s is shorthand for a named character class. It's a character class that is equivalent to [ \t\r\n], or, in other words, every whitespace character. The regexp as a whole will match the word "fun" followed by any whitespace. Note that these shorthand character classes change depending on your locale.

  • /f../ - The dot element will match any character. This regex will match "foo", "far", "f!!" or any sequence of characters that's f followed by two more characters.

Quantifiers

A quantifier will define how many of the previous elements to match. Think of it as if all elements to have an invisible quantifier that specifies that one of the elements should match. There are a number of quantifiers available for use.

  • /fo*d/ - The asterisk quantifier means "zero or more" of the previous element. In this example, zero or more o characters will match. The regexp as a whole will match not only the word "food" but "fooooood", "fod" and "fd".

  • /fo+d/ - The plus quantifier is a lot like the asterisk quantifier, except it means "one or more." While /fo*d/ will match "fd", /fo+d/ will not, as the plus quantifier requires at least one of the previous element to match.

  • /fo?d/ - The question mark quantifier means "zero or one" or put more plainly, "an optional." The o in the regex is optional, it may or may not be there. This regex will match both "fd" and "fod".

Something Practical

Using only what was taught above, you can now begin to more effectively use the methods in Ruby that support Regexp objects. Here are a few practical examples.

  • "Walk teh dog".gsub( /teh/, "the" ) - Imagine you make frequent typos. You can use this example to to replace every occurrence of "teh" with "the".

  • "Extra spaces are bad".gsub( / +/, " " ) - If your space bar sticks, you'll need to replace multiple spaces with single spaces. Note the space before the plus character in the regexp.

  • "These are words".scan( /\w+/ ) {|w| puts w} - Here you can iterate over every word in a string. The \w character class means "word character," or any character that should appear in a normal word in your language. Remember what was said about locales, this changes depending on what language your computer is set to use.

  1. About.com
  2. Computing
  3. Ruby
  4. Regular Expressions
  5. Regular Expression Syntax

©2014 About.com. All rights reserved.