67

I'm using ruby 1.9.2

I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.

When I read the lines from the CSV file,

file_contents = CSV.read("csvfile.csv", col_sep: "$")

The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes sp\xE9cifi\xE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.

Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.

So, if I try to make CSV force the encoding like this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")

I get the following error

ArgumentError: invalid byte sequence in UTF-8: 

If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non sp\xE9cifi\xE9" instead of "Non spécifié".

I can't convert "Non sp\xE9cifi\xE9" to "Non spécifié" by doing this "Non sp\xE9cifi\xE9".encode("UTF-8")

because I get this error:

Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8,

which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".

Questions:

  1. Can I get CSV to read my file in the appropriate encoding? If so, how?
  2. How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?
jpemberthy
  • 7,473
  • 8
  • 44
  • 52
user141146
  • 3,285
  • 7
  • 38
  • 54
  • It sounds like the file might not be UTF-8 encoded; have you checked the actual encoding of the file? – coreyward Aug 13 '11 at 01:34
  • 3
    Your file is not encoded in UTF-8. é in UTF-8 should be `C3 A9`, not `E9`. Looks like you're dealing with ISO-8859-1 instead. – deceze Aug 13 '11 at 01:34
  • 3
    I think I figured it out: my_ascii_8bit_string.unpack("C*").pack("U*") seems to work. – user141146 Aug 13 '11 at 01:34
  • @deceze: Yes, the file isn't UTF-8 encoded, but I wanted a way to do it via ruby – user141146 Aug 13 '11 at 01:35
  • Then the correct way would be to read the CSV as ISO-8859-1 and convert the result from ISO-8859-1 to UTF-8 using encoding conversion functions. Unfortunately my Ruby isn't good enough to tell you how to do that. – deceze Aug 13 '11 at 01:37

3 Answers3

71

deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")

And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:

require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first

If latin1_string is "Non sp\xE9cifi\xE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:

utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)

With newer Rubies, you can do things like this:

utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')

where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.

Community
  • 1
  • 1
mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • 3
    Note that Ruby now wants you to use `String#encode` rather than using `iconv`. – duma Mar 20 '13 at 15:14
  • 1
    @duma: better now? I left the old Iconv stuff and added a short note about using `force_encoding` and `encode` instead of Iconv. – mu is too short Mar 20 '13 at 21:12
  • 1
    `CSV.foreach` worked for me, but I had to use `encoding: "iso-8859-1"` instead of `encoding: "ISO8859-1"` – ltrainpr Apr 09 '15 at 21:22
36

With ruby >= 1.9 you can use

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")

The ISO8859-1:utf-8 is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8

If you prefer a more verbose code, you can use:

file_contents = CSV.read("csvfile.csv", col_sep: "$", 
    external_encoding: "ISO8859-1", 
    internal_encoding: "utf-8"
  )
knut
  • 27,320
  • 6
  • 84
  • 112
  • 1
    This is awesome. Before, I had to put in a `bom` for this utf-16 csv: ```CSV.read('nom_nom_nom.csv', { :headers => true, :col_sep => "\t", :encoding => 'bom|utf-16le'})```, otherwise it would throw errors. Now it is: ```CSV.read('nom_nom_nom.csv', { :headers => true, :col_sep => "\t", external_encoding: 'utf-16', internal_encoding: "utf-8"}) ```. – Hahn Jun 15 '16 at 04:22
1

I have been dealing with this issue for a while and not any of the other solutions worked for me.

The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:

tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)
fguillen
  • 36,125
  • 23
  • 149
  • 210