1. Technology
You can opt-out at any time. Please refer to our privacy policy for contact information.

How to Parse Twitter XML in Ruby on Rails

Using the Hpricot Parser

By

If you're using the Twitter API and you issue the command to query a timeline, you'll get what looks to be a whole mess of XML. If you look closely, the XML is well-structured and hierarchical. The encompassing array is the statuses element. This houses a list of status elements which describes each individual status update. In the status element are various other elements, such as text and created_at. Nested even more deeply inside the status element is the user element, which describes the user the update belongs to.

It sometimes helps to view the XML with all the information you do not want taken out. Below is just such a view, the only remaining elements are the ones we're interested in. This should give you a much better idea of the structure of the document we're looking at.

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>
<created_at>Thu Feb 26 03:11:50 +0000 2009</created_at>
<text>Posted from CURL!</text>
<user>
<name>About Ruby</name>
</user>
</status>

<status>
<created_at>Thu Feb 26 03:07:42 +0000 2009</created_at>
<text>A test post for the XML example</text>
<user>
<name>About Ruby</name>
</user>
</status>
</statuses>

Hpricot

There are many ways to parse XML data with Ruby, but I prefer the Hpricot for a non-validating parser. Though it's intended for HTML, it will work just fine for these purposes. There is another parser, called Nokogiri, but it's harder to install as it has more dependencies. Ruby also comes with the REXML library, and there are a few other libraries and bindings to native libraries you can use.

The basic concept of the following code is to iterate over all status elements and, for each element, print the text and the name of the person that said it. This uses the CSS-style selectors that Hpricot supplies. The status selector will get all status elements, while the user name selector will return all name elements that are child elements of user elements.

This parses data saved in a file called timeline.xml, which was obtained with the following command. Parsing data directly from the server will work in a similar way.

$ curl -u aboutruby:pass123 http://twitter.com/statuses/friends_timeline.xml >timeline.xml
#!/usr/bin/env ruby
require 'rubygems'
require 'hpricot'

doc = Hpricot(open('timeline.xml'))

(doc/'status').each do|st|
user = (st/'user name').inner_html
text = (st/'text').inner_html

puts "#{user} said #{text}"
end

Making HTTP Queries From Ruby

The previous example was good, but it was reading from an XML document stored on the hard drive. This XML was obtained using CURL, not the ideal way of going about things. It's better to query Twitter directly from your Ruby program for the information.

Ruby comes with an HTTP library we can use called Net::HTTP. To start using this library, include the net/http library.

The first step is to make a TCP connection to the Twitter servers. This is done with the Net::HTTP.start method. Pass just 'twitter.com' to this, as it only needs to know the hostname. It returns an object that allows us to communicate with Twitter, and we'll store it in a variable called twitter. We'll use this variable to make a GET request. Note that the connection is not closed until the variable goes out of scope, so make sure to limit the scope of the variable. The start method provides a block mechanism much like File.open to make this easier, but it's not use here since the api method provides the correct scope.

Next, you need to set up the request object. This will be a GET request, so we'll use the Net::HTTP::Get class. It needs to be passed the path is should access (minus the hostname), and it needs to know the username and password for authentication. Alternatively, the Net::HTTP::Post class can be used to create a POST request.

All of this has been encapsulated in a reusable method called twitter. Since it was designed to be reusable, it's a bit longer and does a bit of error checking. It's commented though, so it shouldn't be difficult to follow.

Finally, the request is made, the response is read, its body is extacted and an Hpricot document is created from it. The code from then on is the same as the previous example.

#!/usr/bin/env ruby
require 'rubygems'
require 'hpricot'
require 'net/http'

$username = 'aboutruby'
$password = 'pass123'

def twitter(command, opts={}, type=:get)
# Open an HTTP connection to twitter.com
twitter = Net::HTTP.start('twitter.com')

# Depending on the request type, create either
# an HTTP::Get or HTTP::Post object
case type
when :get
# Append the options to the URL
command << "?" + opts.map{|k,v| "#{k}=#{v}" }.join('&')
req = Net::HTTP::Get.new(command)

when :post
# Set the form data with options
req = Net::HTTP::Post.new(command)
req.set_form_data(opts)
end

# Set up the authentication and
# make the request
req.basic_auth( $username, $password )
res = twitter.request(req)

# Raise an exception unless Twitter
# returned an OK result
unless res.is_a? Net::HTTPOK
doc = Hpricot(res.body)
raise "#{(doc/'request').inner_html}: #{(doc/'error').inner_html}"
end

# Return the request body
return Hpricot(res.body)
end

doc = twitter('/statuses/friends_timeline.xml')

(doc/'status').each do|st|
user = (st/'user name').inner_html
text = (st/'text').inner_html

puts "#{user} said #{text}"
end
  1. About.com
  2. Technology
  3. Ruby
  4. Networking
  5. Twitter and Hpricot--How to Parse Twitter XML in Ruby on Rails

©2014 About.com. All rights reserved.