22

I am trying to populate the movie object, but when parsing through the u.item file I get this error:

`split': invalid byte sequence in UTF-8 (ArgumentError)

File.open("Data/u.item", "r") do |infile|
            while line = infile.gets
                line = line.split("|")
            end
end

The error occurs only when trying to split the lines with fancy international punctuation.

Here's a sample

543|Misérables, Les (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Mis%E9rables%2C%20Les%20%281995%29|0|0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|0

Is there a work around??

kashive
  • 1,356
  • 2
  • 11
  • 17

2 Answers2

21

I had to force the encoding of each line to iso-8859-1 (which is the European character set)... http://en.wikipedia.org/wiki/ISO/IEC_8859-1

a=[]
IO.foreach("u.item") {|x| a << x}
m=[]
a.each_with_index {|line,i| x=line.force_encoding("iso-8859-1").split("|"); m[i]=x}
kashive
  • 1,356
  • 2
  • 11
  • 17
  • 7
    You can specify what encoding Ruby should use when using `open`, e.g. `File.open 'data.txt', 'r:iso-8859-1' do ...`. See [the docs](http://ruby-doc.org/core-1.9.3/IO.html#method-c-new). – matt Jun 17 '12 at 16:46
15

Ruby is somewhat sensitive to character encoding issues. You can do a number of things that might solve your problem. For example:

  1. Put an encoding comment at the top of your source file.

    # encoding: utf-8
    
  2. Explicitly encode your line before splitting.

    line = line.encode('UTF-8').split("|")
    
  3. Replace invalid characters, instead of raising an Encoding::InvalidByteSequenceError exception.

    line.encode('UTF-8', :invalid => :replace).split("|")
    

Give these suggestions a shot, and update your question if none of them work for you. Hope it helps!

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
  • 1
    The error he's getting implies the encoding already is UTF-8. – Andrew Marshall Jun 16 '12 at 19:34
  • So, I inspected the each line before the program tries to split it. It turns out that the error occurs in lines with fancy punctuations Here is the record where the error occurred: 543|Misérables, Les (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Mis%E9rables%2C%20Les%20%281995%29|0|0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|0 I tried the third option as well, didn't work out...Any ideas? or alternative ways... – kashive Jun 16 '12 at 20:09
  • 1
    This seems to address your edge case: http://stackoverflow.com/a/10466273/1301972 – Todd A. Jacobs Jun 16 '12 at 21:23
  • 1
    Found a working solution from this question: http://stackoverflow.com/questions/7047944/ruby-read-csv-file-as-utf-8-and-or-convert-ascii-8bit-encoding-to-utf-8/7048129#7048129 – Shadoath Nov 30 '15 at 20:52