1. Computing

Searching Arbitrary Tags, AKA "Scraping"

By

Thus far, only tags for searching for specific elements and tags have been discussed. These were Mechanize's abstracted view of an HTML document. However, there will be times when you want to query for the text from a certain element, otherwise known as "scraping."

Mechanize is powered by Nokogiri. Almost everything going on behind the scenes except for the actual fetching of the pages themselves is done by Nokogiri. You could say that Mechanize itself is just a bit of glue. But Nokogiri itself is a very powerful HTML and XML parser. And Mechanize exposes the Nokogiri parser object to you should you choose to use it. To search for a specific element using Nokogiri, use the Page#search method.

Nokogiri and the technologies it encompasses are too large of a subject to explain here. So, the following examples should be enough to get you started. If you wish to know more, read the Nokogiri documentation. In particular, if you want to be proficient in scraping, learn about Xpath.

The following example will open up http://www.reddit.com/ and see if you are logged in. This is done by finding the span with the class user in the div with the id of header and examining the text. The "text" of an element is all of the inner text, without any of the HTML tags. If this text contains the string "want to join, you are not logged in, and the program will exit.


agent = Mechanize.new
page = agent.get('http://www.reddit.com/')

text = page.search(%Q{//div[@id='header']//span[@class='user']}).text
if /want to join/ === text
  puts "You are not logged in"
  exit
end

Though this is mostly due to technologies like XPath and their implementation in Nokogiri, this is perhaps the most difficult (or at least time-consuming) and most powerful feature of Mechanize. Being able to surgically remove the exact portion of a page you need is, for many, the entire reason for using Mechanize.

  1. About.com
  2. Computing
  3. Ruby
  4. Tutorials
  5. The Mechanize 2.0 Handbook
  6. Searching Arbitrary Tags, AKA "Scraping"

©2014 About.com. All rights reserved.