1. Computing

Iterating Over Strings

By

Iterating Over Strings

Even with the modern toolkit of regular expressions and methods like split, partition or scan, you may still want to iterate over strings manually. Perhaps the strings hold some encoded data, or you simply need to count certain characters. It's not a common task, but at some point in time you'll need to iterate over a string.

Character Encoding

Historically (read: in C), iterating over strings was easy. A character was one byte, and a string was just an array of characters. Either use array indexing or pointers and iterate over the bytes. It was no different than any other for loop. But things are not so simple in Ruby.

Start with Ruby 1.9.1 and with all versions after, strings are by default Unicode. Unicode is a complex topic, but all you really need to know is that not all characters from all languages can fit into the default (non-extended) 128 ASCII characters, so a larger encoding scheme was needed. The new default UTF-8 encoding mostly uses 1 byte characters, but some may be longer. Some characters may be 2, 3 or even 4 bytes long. Since characters aren't necessarily one byte long, you cannot simply iterate over the bytes of a string and expect to have it yield coherent characters. Something more sophisticated is needed.

First, to play with, you'll need a Unicode string. Accented characters are often Unicode characters, and the first I could think of was "mêlée." It contains two Unicode accented characters. If you fire up 1.8.7 (don't have that anymore? Install it with RVM) and try to assign that to a string, you'll get the following.


1.8.7 :001 > str = "mêlée"
 => "m\303\252l\303\251e" 

The accented characters come out as unprintable characters and are escaped as octal. You can turn on Unicode in 1.8.7 by assigning "u" to $KCODE. If you do that or fire up 1.9.x and try the same thing, you'll see the expected string.


1.9.3p0 :001 > str = "mêlée"
 => "mêlée" 

Iterating with each_char

In 1.9.x, the preferred way to iterate over a string is to use the each_char method. This works just as you'd think it would. It takes a block and yields for each character. It knows about Unicode and will yield each character, not each byte.


"mêlée".each_char do|c|
  puts c
end

Similarly, if you wish to iterate over the string's numerical values, you can use the each_codepoint method. This might sound a bit cryptic, what is a "codepoint?" Each Unicode character has two parts. It will have a codepage which refers to a set of related characters. For instance, the characters for the Thai language will be together on a codepage. The offset on that codepage is called the codepoint. So, in essence, a unicode character is a reference to a page, and a specific offset on that page. By using each_codepoint you can iterate over the numeric value of the characters in a string without it failing on unicode characters.


"mêlée".each_codepoint do|i|
  puts i
end

The Most Primitive: each_byte

Maybe you don't care about unicode? Maybe this string contains binary data which might not be valid UTF-8? In this case, what you really want is the each_byte method. This method will iterate over the string ignoring all unicode characters it encounters. This is the most primitive of all the string iteration methods, whatever is in the C string behind the scenes (remember, Ruby is written in C, and the String class is merely a wrapper around a C character array of some kind), it will yield each and every byte as an integer. This is best used if the string contents is not text.


"mêlée".each_byte do|b|
  puts b
end

Iterating Over Lines

Strings can contain multiple lines. Somewhere in the middle of the string there may be one of more newline characters. If you wish to iterate over the lines, use the each_line method, with yields each line it finds in the string. Use this if you just read the entire contents of a file, or a large chunk of text from the network. Though each_line can also be called directly on a File object, which will be much more efficient since the entire contents of the file don't have to be read.


File.read('something.txt').each_line do|l|
  puts l
end

Splitting and Iterating

A trick that was used quite often in the 1.8.x tree was splitting a string into and array of one-character strings and iterating over that. That's not terribly useful in 1.9.x, but it was one of the only ways of reliably iterating over strings in 1.8.x.


"a string".split(//).each do|c|
  puts c
end
  1. About.com
  2. Computing
  3. Ruby
  4. Beginning Ruby
  5. Strings
  6. Iterating Over Strings

©2014 About.com. All rights reserved.