1. Technology
You can opt-out at any time. Please refer to our privacy policy for contact information.

The Mechanize 2.0 Handbook

By

Mechanize is a library for automated interaction with web sites. For all intents and purposes, it acts like a web browser with no user interface. It downloads web pages, can click on links, fill out and submit forms, store cookies, etc. Mechanize is useful for automated crawling, testing and scraping of web sites.

However, Mechanize does have one down side: it doesn't support Javascript. Mechanize is a rather small piece of software, more or less wrapping an HTTP library and Nokogiri. Because of this, any Javascript on the pages downloaded will not execute, the DOM tree will not be updated and any links that need Javascript to appear or function will not work with Mechanize.

  • The Mechanize Agent - A Mechanize agent object is where everything starts. Normally, your web browser is your "user agent," something that acts on your behalf of the user to fetch web pages, Mechanize is your Ruby program's agent. After creating and configuring this object, you can fetch your first Page. This object is also continually working in the background (though you won't often interact with it directly after fetching your first Page). Here you'll take care of things like setting up your user agent, configuring proxies, etc.

  • The Page - The Page represents a page that's been fetched from a web site. From the page object, you can interact with links, fill out and submit forms, or use the Nokogiri parser to scrape information from the page. For many tasks, you'll mostly be interacting with Page objects.

  • The Link - The gateway to other pages. The typical workflow is to use an agent to fetch a page, then navigate to more pages using links. While this is a small class under Page's namespace, you'll be using Link objects quite often, and it's useful to know what they can do.

  • Forms - Other than links, about the only way to get to another page or submit any data to the site is with a form. The form object is quite capable, able to submit all types of fields that a user with a web browser would be able to. This includes the ability to upload files and manipulate all types of form fields, such as drop-down boxes, check boxes, text fields, etc.

  • Scraping with Nokogiri - Mechanize is powered by Nokogiri, and exposes and Nokogiri interface to you for scraping needs. You don't need to rely on the rather narrow interface Mechanize provides for following links and filling out forms, you can extract any information you'd like from the page using the very capable Nokogiri library.

  • Cookies - Mechanize acts like a stateful browser, and as such stores cookies in a cookie jar. Mechanize also exposes an interface for interacting with cookies, both individually and importing and exporting entire cookie jars. This is most useful for testing.

  • User Agent Aliases - Mechanize is able to disguise itself as various web browsers by mimicking their user agent strings. These strings are difficult to remember, so Mechanize provides a number of aliases for common browsers.

  • History - Mechanize will keep a history of all URLs it has accessed. Much like your web browser history, this history can be accessed via the Mechanize object itself.

  • Pluggable Parsers - While Mechanize is primarily designed to parse HTML and XHTML pages, it also provides the ability to parse pages of any other format. Much like Mechanize's own HTML parser class (the Page class), you can provide your own classes for parsing any file format.

  • Response Headers - Response headers often hold small pieces of useful information. Other than the cookies, headers can tell you how long a page should be cached, whether authentication is needed, etc. Mechanize provides an interface to access the response headers for a page.

  • Authentication - While form-based authentication with an authorization cookie might be straightforward to handle with a simple cookie, HTTP authentication is a bit more tricky. Thankfully, Mechanize provides an interface for doing just this.

  • Dealing With Errors - Not everything goes right all of the time. When something goes wrong, Mechanize will raise an exception, which you can catch and handle.

  1. About.com
  2. Technology
  3. Ruby
  4. Tutorials
  5. The Mechanize 2.0 Handbook

©2014 About.com. All rights reserved.