1. Computing

Parsing using Named Capture Groups

By

Parsing IRC Messages

This article is part of a series on exploring evented programming by building a distributed IRC bot.

In the article on Named Regexp Groups, we looked at a novel way to parse text using regular expressions without it turning into an unreadable mess. We'll extend that to parse any IRC message in a readable and rather pleasant way.

The previous article on IRC Messages gave an overview of IRC messages, as well as showed you how to connect to an IRC server using Telnet or netcat to get an IRC log. You'll need some kind of IRC log to play with at first, so either create one yourself, or download my log.

The RFC, Don't Use It

In this case, it's fine to ignore the RFC, because everyone else does. An RFC (Request For Comments) typically outlines the exact format for a protocol, down to the bit. If your client goes against the RFC, it will bite you at some point. However, it seems no one actually follows this RFC. I had written a much more detailed parser, but it turned out to be useless as the exact set of characters outlines in the RFC is simply not used. They're much more lax in real life.

BNF

"Real" parsers are defined using Backus-Naur Form. A BNF defines a message as the conceptual parts of a message. Such as "An IRC message is a prefix, a command and a list of parameters." It then goes on to define each sub-part until it gets down to individual characters, often descending down many optional paths. For example, a prefix can be either a username, nick and host formatted like nick!user@host, or it can just be a hostname. The BNF for the IRC protocol can be found in the RFC, but as I said we're not following it this time around.

So why even mention BNF? Because what we're doing with named groups in regular expressions is very close to a BNF (only backwards). Because named groups can only be recalled after they've been defined (not like Ruby methods which can be called from any part of the code, no matter which ones comes first in the file), we have to do inside-out BNF. We have to start with the most basic things and build up to a complete IRC message.

The objective here is to extract as much information from the IRC message as possible. Ideally, we should be able to simply throw IRC messages at the regular expression and each individual part of the message is available in the match data. For example, we should be able to say md = ":some.server.com 001 RubyTest :Welome to the server!".match(IRC_REGEXP). In this example, the md holds our MatchData object, and we should be able to query it for things like md[:command] and get the string "001" back, or md[:prefix] and get :some.server.com. Parsing is always a difficult or at least tedious thing to get right. You end up with some monstrosity of nested conditional statements and case statements that's very difficult to write and maintain. Since we're letting the regular expression engine do all the heavy lifting and just tell it what to name things, our job should be rather simple.

The Process

The process I use for defining a regular expression is to start a the most generic and work backwards through the regular expression. Let me just show you what I start with (this regexp won't work, not quite yet!):


#!/usr/bin/env ruby

IRC_REGEXP = /
  (?<prefix>){0}
  (?<command>){0}
  (?<params>){0}

  ^(\g<prefix>\ )?(\g<command> )(\g<params>)?(\r\n)?$
/x

File.open('log.txt') do|f|
  f.each do|line|
    md = line.match(IRC_REGEXP)
    puts md.inspect
  end
end

The "functions" at the top of the regexp are empty right now. The match data will always be nil, but we'll soon fix that. I'll begin to fill those in starting with the beginning of the line, the prefix. I'll also add in a hack to ignore the unmatched (as of yet) command so the incomplete regexp works.


#!/usr/bin/env ruby

IRC_REGEXP = /
  (?<nick>[^!]+){0}
  (?<user>[^@]+){0}
  (?<host>[^\ ]+){0}

  (?<prefix_host>\g<host>){0}
  (?<prefix_user>\g<nick>(!\g<user>(@\g<host>)?)?){0}

  (?<prefix>:(\g<prefix_user>|\g<prefix_host>)){0}
  (?<command>.+){0}
  (?<params>){0}

  ^(\g<prefix>\ )?(\g<command> )(\g<params>)?(\r\n)?$
/x

File.open('log.txt') do|f|
  f.each do|line|
    md = line.match(IRC_REGEXP)
    if md[:nick]
      puts "#{md[:nick]} #{md[:user]} #{md[:host]}"
    else
      puts md[:host]
    end
    puts line
    puts "----"
    puts
  end
end

OK, things are starting to get a bit more complex here. Starting with the definition of a prefix, a prefix is a colon followed by either a prefix user and a prefix host. A prefix user is a user in the form of nick!user@host. Note that the actual definitions for user, nick and host don't follow the RFC, as noted. They only match the delimiting characters to avoid any inconsistencies some servers may present. The loop at the bottom changed a bit. As you can see, after running the match on the lines, you can pull out individual parts of the command using md[:nick] or md[:host]. Things are looking up. Now to implement the commands.


#!/usr/bin/env ruby

IRC_REGEXP = /
  (?<nick>[^!]+){0}
  (?<user>[^@]+){0}
  (?<host>[^\ ]+){0}

  (?<prefix_host>\g<host>){0}
  (?<prefix_user>\g<nick>(!\g<user>(@\g<host>)?)?){0}

  (?<prefix>:(\g<prefix_user>|\g<prefix_host>)){0}
  (?<command>[0-9A-Z]+){0}
  (?<params>.+){0}

  ^(\g<prefix>\ )?(\g<command>\ )?(\g<params>)?(\r\n)?$
/x

File.open('log.txt') do|f|
  f.each do|line|
    md = line.match(IRC_REGEXP)
    puts md[:command]
    puts line
    puts "----"
    puts
  end
end

There was no need to make any more "functions" here, as the command part of the message is so simple. And we added in a hack to get the params working for now. But now, to implement the params for real. Remember that params can look like param1 param2 :Param with spaces. The objective is to get every part of the command into capture groups, but there is no "array capture." We can't say md[:params] and have it return an array, that will always return a string. This is the part we'll only need to cheat on, we'll manually split the params into constituent parts after first extracting the trailing. The "trailing" is the parameter on the end with the spaces, it's so common we need special and convenient access to it.


#!/usr/bin/env ruby

IRC_REGEXP = /
  (?<param>[^\ ]+){0}
  (?<trailing>[^\r\n]+){0}

  (?<nick>[^!]+){0}
  (?<user>[^@]+){0}
  (?<host>[^\ ]+){0}

  (?<prefix_host>\g<host>){0}
  (?<prefix_user>\g<nick>(!\g<user>(@\g<host>)?)?){0}

  (?<prefix>:(\g<prefix_user>|\g<prefix_host>)){0}
  (?<command>[0-9A-Z]+){0}
  (?<params>(\g<param>\ ?)?(:\g<trailing>)?){0}

  ^(\g<prefix>\ )?(\g<command>\ )?(\g<params>)?(\r\n)?$
/x

File.open('log.txt') do|f|
  f.each do|line|
    md = line.match(IRC_REGEXP)

    if md[:param]
      puts md[:param].split(' ').inspect
    end
    if md[:trailing]
      puts md[:trailing]
    end
    puts line
    puts "----"
    puts
  end
end

And that's pretty much it. Each part is defined in its own little corner, we only need one hack to get to the parameters (and we'll clean that up once we abstract this in a later article). We can even pull out the individual parts. When I first started this article, I had written a much more complete regular expression. While it does work, and I have tested it more, it's unnecessarily complex and carried the "BNF-like parser in regexp" idea a bit too far. But, for completeness, here it is. It's 32 lines (that's a long regexp!), I really don't recommend doing it this way though, at least for IRC.


module IRC
  class Message
    FORMAT = %q{
      (?<crlf>\x0d\x0a){0}
      (?<letter>[a-zA-Z]){0}
      (?<number>[0-9]){0}
      (?<special>[-~_\[\]\\\`\^\{\}]){0}
      (?<nonwhite>[^\ \x00\x0d\x0a]){0}
      (?<paramchar>[^\x00\x0d\x0a]){0}
      (?<chanchar>[^\ \x07\x00\x0d\x0c,]){0}
      (?<userchar>\g<letter>|\g<number>|\g<special>){0}
      
      (?<trailing>:\g<paramchar>+){0}
      (?<params>\g<paramchar>+){0}
      
      (?<command>[a-zA-Z]+|[0-9]{3}){0}
      
      (?<ip4addr>[0-9]{1,3}(\.[0-9]{1,3}){3}){0}
      (?<ip6addr>\h{1,4}(:\h{1,4}){7}){0}
      (?<shorthost>[a-zA-Z0-9][a-zA-Z0-9/-]+){0}
      #(?<host>\g<shorthost>(\.\g<shorthost>?)+){0}
      (?<host>\g<nonwhite>+){0}
      
      (?<hostname>\g<ip4addr>|\g<ip6addr>|\g<host>){0}
      
      (?<channel>[#&]\g<chanchar>+){0}
      (?<user>\g<userchar>+){0}
      (?<nick>\g<nonwhite>+){0}
      (?<mask>[\#$]\g<chanchar>+){0}
      (?<to>\g<channel>|(\g<user>@\g<hostname>)|\g<nick>|\g<mask>){0}
      (?<target>\g<to>(,\g<to>)*){0}
      
      (?<prefix>\g<host>|(\g<nick>(!\g<user>(@\g<host>)?)?)){0}
    }
     
    MESSAGE_FORMAT = /
      #{FORMAT}

      (?<message>^(:\g<prefix>\ )?\g<command>\ \g<params>\g<crlf>$)
    /x

    USER_FORMAT = /
      #{FORMAT}

      \g<nick>\!\g<user>\@\g<host>
    /x
  1. About.com
  2. Computing
  3. Ruby
  4. Tutorials
  5. Distributed IRC Bot
  6. Parsing IRC Messages

©2014 About.com. All rights reserved.