Ruby `split': invalid byte sequence in UTF-8 (ArgumentError)

Question

I am trying to populate the movie object, but when parsing through the u.item file I get this error:

`split': invalid byte sequence in UTF-8 (ArgumentError)

File.open("Data/u.item", "r") do |infile|
            while line = infile.gets
                line = line.split("|")
            end
end

The error occurs only when trying to split the lines with fancy international punctuation.

Here's a sample

543|Misérables, Les (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Mis%E9rables%2C%20Les%20%281995%29|0|0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|0

Is there a work around??

It works for me with the corpus as posted. @IgnacioVazquez-Abrams is probably right: you need to use a hex editor to see if you have hidden characters in your data file. — Todd A. Jacobs, Jun 16 '12 at 18:57

score 21 · Answer 1 · answered Jun 17 '12 at 14:07

21

I had to force the encoding of each line to iso-8859-1 (which is the European character set)... http://en.wikipedia.org/wiki/ISO/IEC_8859-1

a=[]
IO.foreach("u.item") {|x| a << x}
m=[]
a.each_with_index {|line,i| x=line.force_encoding("iso-8859-1").split("|"); m[i]=x}

answered Jun 17 '12 at 14:07

kashive

7

You can specify what encoding Ruby should use when using `open`, e.g. `File.open 'data.txt', 'r:iso-8859-1' do ...`. See [the docs](http://ruby-doc.org/core-1.9.3/IO.html#method-c-new). – matt Jun 17 '12 at 16:46

Todd A. Jacobs · Answer 2 · 2012-06-16T19:12:32.667

15

Ruby is somewhat sensitive to character encoding issues. You can do a number of things that might solve your problem. For example:

Put an encoding comment at the top of your source file.
```
# encoding: utf-8
```
Explicitly encode your line before splitting.
```
line = line.encode('UTF-8').split("|")
```
Replace invalid characters, instead of raising an Encoding::InvalidByteSequenceError exception.
```
line.encode('UTF-8', :invalid => :replace).split("|")
```

Give these suggestions a shot, and update your question if none of them work for you. Hope it helps!

edited Jun 16 '12 at 19:12

answered Jun 16 '12 at 18:42

Todd A. Jacobs

1

The error he's getting implies the encoding already is UTF-8. – Andrew Marshall Jun 16 '12 at 19:34
So, I inspected the each line before the program tries to split it. It turns out that the error occurs in lines with fancy punctuations Here is the record where the error occurred: 543|Misérables, Les (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Mis%E9rables%2C%20Les%20%281995%29|0|0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|0 I tried the third option as well, didn't work out...Any ideas? or alternative ways... – kashive Jun 16 '12 at 20:09
1

This seems to address your edge case: http://stackoverflow.com/a/10466273/1301972 – Todd A. Jacobs Jun 16 '12 at 21:23
1

Found a working solution from this question: http://stackoverflow.com/questions/7047944/ruby-read-csv-file-as-utf-8-and-or-convert-ascii-8bit-encoding-to-utf-8/7048129#7048129 – Shadoath Nov 30 '15 at 20:52

2 Answers2