1

I need to parse a CSV file that contains the degree symbol (°) inside a header. If I try to open the file:

CSV.foreach('myfile.csv', headers: true) do |row|
  ...
end

I get invalid byte sequence in UTF-8 (ArgumentError). So I tried few other encodings (ISO-8859-1 and ASCII-8BIT), but I always get a CSV::MalformedCSVError error.

Which encoding should I specify in order to be able to read the file?

Actually I don't care about the degree symbol, so it works also for me a solution that simply ignores it (and returns for instance 'Tx1 C' instead of 'Tx1 °C').

Sig
  • 5,476
  • 10
  • 49
  • 89
  • Look into http://stackoverflow.com/questions/9639153/character-encoding-issue-exporting-rails-data-to-csv. That may help you out. – Rajesh Omanakuttan May 09 '14 at 05:30
  • If you do not have problem reading the string in Ruby (outside of CSV routine), then perhaps you can just remove all the `°` symbols prior to reading it with CSV. – sawa May 09 '14 at 05:37

2 Answers2

1

The default encoding for parsing external files are UTF-8 (Encoding.default_external). However, the CSV file isn't stored in UTF-8. When Ruby tries to parse non-UTF-8 encoded byte sequence using UTF-8 encoding, error arises if the two encoding isn't compatible.

You should first get the actual encoding of your CSV file. This can be determined by open the CSV file in Notepad++ and check the option under the Encoding menu. Some other text editor has similar utility, too, such as VIM, UltraEditor...

Suppose you find the actual encoding of the CSV file is GBK, rewrite your code as

CSV.foreach('myfile.csv', headers: true, encoding: 'GBK') do |row|
 ...
end
Arie Xiao
  • 13,909
  • 3
  • 31
  • 30
  • The encoding is actually ```iso-8859-1``` ```MacBook-Pro:sonde sig$ file -I QLd01haqJ00Kn.csv QLd01haqJ00Kn.csv: text/plain; charset=iso-8859-1``` but as I mentioned before when I specify that encoding I get ```CSV::MalformedCSVError``` – Sig May 09 '14 at 06:08
  • @macsig then it is malformed – Arie Xiao May 09 '14 at 09:03
  • The error has been solved specifying ```row_sep: "\r\r\n"```. – Sig May 09 '14 at 09:08
0

You could shell out a process to remove the little devils before you open it:

system("LANG=C tr -d '\260' < myfile.csv >> $$.tmp && mv $$.tmp myfile.csv")

The tr -d says to delete character code 260, saving the results to a file named with the process id ($$) and the extension .tmp. If that was successful (&&), it replaces the original file.

You can try the tr command on its own at the shell to test it like this:

LANG=C tr -d '\260' < myfile.csv

If you target Windows, the tr command will not work and you may have to do something like this to remove the first line:

more +1 unhappy.csv > happy.csv

Note that more has a limit of 65535 lines though.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Even if I'd rather not to use this approach (since I'm developing on Mac and running the production on Windows and I'm not sure it is consistent on both systems) when I run ```tr -d '\260' < myfile.csv``` I get ```Illegal byte sequence Date;Hour;pv11- Temperature(``` after ( there is the degree symbol. – Sig May 09 '14 at 06:27
  • Of course. I understand. For grins, you could try prefixing the command with "LANG=C", so like this `LANG=C tr -d ...` – Mark Setchell May 09 '14 at 06:33
  • This seems to work. Thanks for the help. If I don't find anything else I will try your solution on both systems and see if I can use it. Have a nice day. – Sig May 09 '14 at 06:36