A lot of you might be familiar with the Hpricot HTML parser for Ruby. Hprocot, created by whytheluckystiff, is an HTML parser written in native C for some serious speed improvements over Ruby-based parsers. For a while, Hpricot has been the standard if you wanted a fast and easy HTML parser in Ruby.
Enter Nokogiri. Nokogiri is similar to Hpricot (an HTML parser that uses native C code), only much, much more. Nokogiri can parse not only HTML, but also XML. It can also use both CSS and XPath selectors, as CSS is much more familiar to those who don't do much work with XML. And the icing on the cake: according to the benchmarks, it's faster than HPricot.
Like any other gem, install it using the gem command. It does use native extensions, so on Linux or OSX you'll need a build environment. On Windows, be sure you select the x86-mswin32-60 package.
C:\Documents and Settings\Username>gem install nokogiri Bulk updating Gem source index for: http://gems.rubyforge.org Select which gem to install for your platform (i386-mswin32) 1. nokogiri 1.0.6 (ruby) 2. nokogiri 1.0.6 (x86-mswin32-60) 3. nokogiri 1.0.5 (x86-mswin32-60) 4. nokogiri 1.0.5 (ruby) 5. Skip this gem 6. Cancel installation >
Basic Usage of Nokogiri
First, you have to have something to parse. The easiest way to get this over the Internet is to use the open-uri library. This will overload the Kernel#open method to allow you to open any protocol Ruby supports as any other file.
Next, you have to create a Nokogiri document object. These examples parse HTML, so the HTML parser will be used. You should use the XML parser for XML files. Once the document is created, it's all ready accept queries.
Here the CSS selector queries are used. The benchmarks suggest these are slower than the XPath queries, but since I'm much more familiar with them that's what we'll use. The great thing here is that you have a choice, use whatever you're comfortable here.
Google results are just a list of links. In short, what you want to is find all a (anchor, or link) tags with the l class. In CSS, this is simply the query a.l.
#!/usr/bin/env ruby require 'rubygems' require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML( open('http://www.google.com/search?q=nokogiri') ) doc.css('a.l').each do|l| puts l.content end
A Little More Advanced
Beyond just iterating over results, you can also store them and perform further queries on them. The type of collections that the css and xpath query methods return is Nokogiri::XML::Nodeset, which is able to take further queries as well as be iterated over.
In this example, the query gets the entire li element that encapsulates the result links and descriptions. These are iterated over as in the previous example, but instead of simply printing their contents, further queries are performed on them. The first gets the link text, and prints it. The second iterates over all the em tags in the list item and prints their content, but indented.
This example is actually quite useful. It's intended to be used with queries such as "* is faster than *". Google allows you to use wildcards in your searches, so practically anything can be returned by this query. The example program prints the text of the link, plus the phrases the matched the wildcard query just under it.
#!/usr/bin/env ruby require 'rubygems' require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML( open('http://www.google.com/search?q=%22*+is+faster+than+*%22') ) doc.css('li.g').each do|l| puts l.css('a.l').first.content l.css('em').each do|em| puts " " + em.content end end