1. Technology
You can opt-out at any time. Please refer to our privacy policy for contact information.

Parsing with Regular Expressions

By

Not Quite, but Almost a Full Parser

This article is part of a series on exploring evented programming by building a distributed IRC bot.

Starting with Ruby 1.9.x, Ruby supports named Regexp groups. A "group" in regular expressions is any character sequence matched inside parentheses. For example, matching test123 with /\w+([0-9]+)/ will match the entire string, but the number at the end will be stored in the capture groups, since it's matched from something inside the parentheses.


2.0.0-p247 :001 > mat_dat = "test123".match(/[a-z]+([0-9]+)/)
 => #<MatchData "test123" 1:"123">
2.0.0-p247 :002 > mat_dat[1]
 => "123"
2.0.0-p247 :003 >

This is OK, and can be used a number of ways, but Ruby also lets you give each capture group a name. This makes things a whole lot more useful in keeping your regular expressions readable (something that is most certainly an issue!). To name a group, use the (?<name>...) sequence. From then on, you can then refer to the capture group by name instead of magic number (such as "1" in the previous example).


2.0.0-p247 :001 > mat_dat = "test123".match(/[a-z]+(?<number>[0-9]+)/)
 => #<MatchData "test123" number:"123">
2.0.0-p247 :002 > mat_dat['number']
 => "123"
2.0.0-p247 :003 >

Recalling Groups and Extended Regular Expressions

Named groups are useful, but recalling a named group is even more useful. This allows you to create regular expression "functions" that can be called to abstract and generalize your regular expressions. In addition to this, we'll also start using extended regular expressions. Extended regular expressions allow you to write your regular expressions spanning more than one line. Whitespace at the beginning and end of lines is ignored, and goes a long way to getting rid of the chunk of line noise that regular expressions that have traditionally been.

First, a simple example showing extended regular expressions.


#!/usr/bin/env ruby

ip_regexp = /
  \d+
  (\.\d+){3}
/x

puts "Argument is an IP address" if ARGV[0].match(ip_regexp)

This is a simple tool that will tell if a string is a valid regular expression. But the extended regular expression makes it very clear that we're looking for a sequence of digits followed three sequences of a dot and more digits. While this was a simple example, just trust me when I say regular expressions get hairy real fast. Keeping them clean with features like this is always worthwhile.

But let's add some "regular expression functions" using named capture groups. A named capture group can be matched again using the \g<name> sequence. If you look at the previous example, it's not DRY. We repeated what we mean by "a number" twice. So let's try that up using a named capture group.


#!/usr/bin/env ruby

ip_regexp = /
  (?<number>\d+){0}

  \g<number>+
  (\.\g<number>){3}
/x

puts "Argument is an IP address" if ARGV[0].match(ip_regexp)

Not all that much has changed. The first line of the regular expression defines the named capture group "number," but there's something odd at the end of the line. The {x} sequence defines a number of matches. In the last line, we want 3 matches, but in the first line we want zero matches? Strangely enough, yes. We don't actually want to match any numbers yet, we just want to get the named capture group down. Later on we recall it to actually do the matching using the \g<number> sequence.

If you'd like to continue reading, the next article in this series is IRC Messages.

©2014 About.com. All rights reserved.