1. Computing

CSV Example: Parsing CSV

By

You can see the complete code that goes along with this article here, or download a zip file.

A rather common file format for tabular data is the CSV, or Comma Separated Value, file. It's especially common as an interchange format, every spreadsheet program out there is bound to support CSV files. And for good reason, they're dead simple. Each row is just a comma-separated list of values, and the first row is usually a comma-separated list of column names. They're also ancient, one of the first widely-used file formats in the computing (back when it really was "computing") industry.

In this example, we'll parse a CSV file. This CSV file holds the grades in 5 subjects for a group of students (the Corleone family). After parsing the file, we'll also do some manipulations on the results, sort them and print them in a nicely formatted table. Note that this is just an example program. This CSV code has some limitations, and Ruby comes with a CSV library. So don't hold onto this code too hard, or think that you need to write your own CSV parsing code every time you encounter a CSV.

The Data

The following is the CSV file we'll be parsing. It's a simple file, with no surprises. The first line has the column names, and the next lines have the students and their grades. Note that there grades are all simple integers and the names don't contain any punctuation. The biggest problem is writing a CSV parser is dealing with commas embedded in row data. But we'll ignore that for this example. So here is the CSV file we'll be parsing.


Name,Art,History,Math,English,Science
Vito Corleone,87,85,72,65,70
Michael Corleone,68,82,90,70,96
Santino Corleone,93,80,81,87,89
Fredo Corleone,62,80,62,62,99
Tom Hagen,77,62,70,83,85

The 'csv' Method

The csv method takes a filename and returns a list of column names and an array of hashes. Each element of the array represents one row of the CSV. The values have been paired with column names and stored in a hash. The following demonstrates the usage of the csv method (we'll get to how it works below). Note that as the csv method has two return values, be sure to assign it to two variables. If you try to assign to just one variable, you'll get an array of these two values.


columns, grades = csv('grades.csv')

puts grades[0]['Name']  # Vito Corleone
puts grades[1]['Name'] # Michael Corleone
puts grades[-1]['Name'] # Tom Hagen

And now an examination of the csv method itself.


# Parse a CSV file with column names.
# The first line of the file is assumed
# to be the column names, and also defines
# the number of columns.  Returns an
# array of hashes.
def csv(file)
  File.open(file) do|f|
    columns = f.readline.chomp.split(',')

    table = []
    until f.eof?
      row = f.readline.chomp.split(',')
      row = columns.zip(row).flatten
      table << Hash[*row]
    end

    return columns, table
  end
end

The theory of operation is simple. Open the file and read the first line. Split it and store it as the column names. This is done in all one method call chain: f.readline.chomp.split(','). If you're not accustomed to this type of method call chain, you're going to have to get used it. They're a common Ruby idiom. Just read them left to right. Here, we are taking the file and calling readline, which reads the next line. Then chomp, which removes any whitespace from the end, and then splitting that string on commas. We're left with an array of the column names.

Then for the table itself. We declare an empty table array and, as long as there are still lines to be read, read a new line and process it. The first step in processing the line is the same step we performed in processing the header: readline, chomp, split on commas. The next two lines are a bit of trickery to save on a lot of work in assembling the hash.

The next line is columns.zip(row).flatten. What are we trying to do and how are we doing it? We're setting ourselves up to use the Hash.[] class method for constructing hashes. It's a shortcut method for building hashes from arrays. It looks at the array and takes alternate keys and values. For example, if we were to call Hash[:a, 1, :b, 2], it would be equivalent to the hash literal { :a => 1, :b => 2 }. This is not a common way to construct a hash, but it is convenient for constructing hashes from array data. Also note that the splat operator is called on the array. The Hash.[] method doesn't take an array argument, it only takes an argument list, so arrays must be splatted first.

So, back to processing the line, getting it read to make the hash. The statement columns.zip(row).flatten is "zipping" together the column names with the row values. When two arrays are "zipped", they're formed into sub-arrays with corresponding values together. If you were to call [:a, :b].zip(1, 2), you would get the array [ [:a, 1], [:b, 2] ]. Next, we flatten this so it'll be reading to be made into a hash. Once the hash is constructed, it's appended to the table using the << operator.

Wow, that's a lot of work done for (mostly) three lines of code. This is one of Ruby's best advantages, it's expressive without being terse or cryptic. With just a bit of learning, these method call chains can be read easily. Though the barrier there is Ruby's wide array of method calls that can be used.

  1. About.com
  2. Computing
  3. Ruby
  4. Beginning Ruby
  5. Strings
  6. Practical Examples
  7. CSV Example: Parsing CSV

©2014 About.com. All rights reserved.