73

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).

being user data, of course, it might not be properly sanitized.

When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name

When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like

goodstring = badstring.no_more_invalid_bytes

One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.

since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...


Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get: "Chains \x96 Accessories"

so one example of the data is a "hyphen" that's actually 8 bit code 96.

--- when we changed our csv parse to assign fldval = d.encode('UTF-8') it throws this error:

Encoding::UndefinedConversionError in StoresController#importfinderitems
"\x96" from ASCII-8BIT to UTF-8

what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.


while not as 'nice' as forcing the encoding, this works at a slight expense to our import time: d.to_s.strip.gsub(/\P{ASCII}/, '') Thank you, Mladen!

jpw
  • 18,697
  • 25
  • 111
  • 187
  • Are you using String#force_encoding? Can you also paste an example string that generates an error? – bensie Feb 19 '11 at 20:24
  • A _very_ rough (but easy) solution would be to strip _all_ non-ASCII characters from the input. Would that work for you? – Mladen Jablanović Feb 19 '11 at 20:38
  • 1
    BTW, `\x96` is a hyphen _in Windows-1251 encoding_. If you are sure that all the incoming data is in that particular one, you could convert it to UTF-8. – Mladen Jablanović Feb 19 '11 at 20:47
  • stripping all non-ascii is fine, how do we do that? And no, we ar enot sure about the input format... many users, many things that can be different, hence our need to "just force it" – jpw Feb 19 '11 at 20:48
  • 1
    used a regex to strip, thanks for the suggestion. would prefer to force encoding but this will work! – jpw Feb 19 '11 at 20:55

9 Answers9

141

Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.

For example:

CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')

Would convert all strings to UTF-8.

Also you can use the more standard encoding name 'ISO-8859-1'

CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
siegy22
  • 4,295
  • 3
  • 25
  • 43
Trung Lê
  • 5,188
  • 5
  • 27
  • 26
  • 3
    What does 'windows-1251:utf-8' mean? – makstaks Feb 05 '14 at 18:40
  • 1
    @hmak It means "convert from windows-1251 character encoding to utf-8 encoding" – Minh Triet Jun 11 '14 at 03:45
  • Thanks a ton man ":encoding => 'windows-1251:utf-8' " did the trick – Animesh Feb 26 '15 at 14:38
  • CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'}) words for me – Bret Weinraub Jul 22 '15 at 17:16
  • 3
    Is there a way to force any encoding, not just windows-1251, to UTF-8? Something like '????:utf-8' ? I.e. I just want to make sure my imported CSV is always UTF8, don't really care what encoding it was originally. – Peter-Jan Celis Jul 22 '15 at 19:06
  • Thanks, `CSV.read('/path/to/file', :encoding => 'windows-1252:utf-8')` was more suitable solution for me. (Note: 1252 instead of 1521). – atw Aug 26 '15 at 11:05
  • I would also like to know how to force an encoding when you don't know the source. Alternatively, I wish there was a way to handle errors on a specific line. If 99 out of 100 lines are fine and one is an issue it would be nice to be able to throw the one bad line out and keep processing. – Altonymous Sep 12 '15 at 03:19
  • @Altonymous: See this question for details on how to do this: http://stackoverflow.com/questions/8125719/rails-importing-csv-fails-due-to-mal-formation/43626465 – Richard Jones Apr 27 '17 at 17:37
  • 1
    You shouldn't say "also you can use" and then show a line that is wildly different than the one you are referring to. Is `:encoding => 'ISO-8859-1'` swappable with `:encoding => 'windows-1251:utf-8'` or no? – Joshua Pinter Jan 13 '19 at 16:10
  • Worked great on a 2GB+ dirty CSV that I couldn't process! – Avishai Mar 06 '20 at 01:21
25
CSV.parse(File.read('/path/to/csv').scrub)
Bill Lipa
  • 2,039
  • 1
  • 19
  • 11
14

I answered a similar question that deals with reading external files in 1.9.2 with non-UTF-8 encodings. I think that answer will help you a lot: Character Encoding issue in Rails v3/Ruby 1.9.2

Note that you need to know the source encoding for you to convert it anything reliably. There are libraries like the one I linked to in my other answer that can help you determine this.

Also, if you aren't loading the data from a file, you can convert the encoding of a string in 1.9.2 quite easily:

'string'.encode('UTF-8')

However, it's rare that you're building a string in another encoding, and it's best to convert it at the time it's read into your environment if possible.

Community
  • 1
  • 1
coreyward
  • 77,547
  • 20
  • 137
  • 166
  • That explanation. while extremely helpful for my understanding of what's going wrong, doesn't seem to address how I cna fix this issue. we have no control over what the users try to import, we just need to sanitize it. And to make matters worse, using fastercsv we don't do an explicit "open" we just do a read "csv_data = CSV.read(params[:upload][:file].tempfile)" – jpw Feb 19 '11 at 20:29
  • 1
    fwiw, we can read the file fine, csv-parse it fine, store it in the dbase fine (well, at least to local dbase have not tried the same test file on heroku yet). yet the /index gripes about the non-utf sequenc. – jpw Feb 19 '11 at 20:30
  • 88
    Ruby 1.9 has CSV class built-in, and its `read` method receives optional `:encoding` parameter, so you could have your file transcoded automatically upon loading using something like `CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')`. – Mladen Jablanović Feb 19 '11 at 20:56
  • @MladenJablanović, that saves the problem I was running into! Thanks! – sivabudh Nov 28 '11 at 20:02
  • @Mladen Jablanović thanks a lot!!!! It was so killing me, couldn't quite figure out that the file was coming from crappy windows machine :) – Alex Bush Jun 07 '13 at 08:39
  • @MladenJablanović, thanks! Please post this as a separate answer rather than a comment! – jpgeek May 01 '16 at 23:54
  • @jpgeek someone already did that, take a look at answer right below ;) – Mladen Jablanović May 02 '16 at 12:25
  • @MladenJablanović so they did. I spoke too soon :-/. You were robbed! – jpgeek May 03 '16 at 03:02
9

Ruby 1.9 can change string encoding with invalid detection and replacement:

str = str.encode('UTF-8', :invalid => :replace)

For unusual strings such as strings loaded from a file of unknown encoding, it's wise to use #encode instead of a regex, #gsub, or #delete, because these all need the string to be parsed-- but if the string is broken, it can't be parsed, so those methods fail.

If you get a message like this:

error ** from ASCII-8BIT to UTF-8

Then you're probably trying to convert a binary string that's already in UTF-8, and you can force UTF-8:

str.force_encoding('UTF-8')

If you know the original string is not in binary UTF-8, or if the output string has illiegal characters, then read up on Ruby encoding transliterations.

joelparkerhenderson
  • 34,808
  • 19
  • 98
  • 119
7

If you are using Rails, you can try to fix it with the following

'Your string with strange stuff #@~'.mb_chars.tidy_bytes

It removes you the invalid utf-8 chars and replaces it with valid ones. More info: https://apidock.com/rails/String/mb_chars

dom
  • 509
  • 1
  • 7
  • 13
3

As mentioned by someone else, scrub works well to clean this up in Ruby 2.1+. If you have a large file you may not want to read the whole thing into memory, so you can use scrub like this:

data = IO::read(file_path).scrub("")
CSV.parse(data, :col_sep => ',', :headers => true)  do |row|
   puts row
end
Andy Fraley
  • 1,043
  • 9
  • 16
2

Upload the CSV file to Google Docs Spreadsheet and re-download it as a CSV file. Import and voila! (Worked in my case)

Presumably Google converts it to the wanted format..

Source: Excel to CSV with UTF-8 Encoding

Community
  • 1
  • 1
Jonathan Lin
  • 19,922
  • 7
  • 69
  • 65
2

I am using MAC and I was having the same error:

rescue in parse:Invalid byte sequence in UTF-8 in line 1 (CSV::MalformedCSVError)

I added :encoding => 'ISO-8859-1' that resolved my error and csv file could be read.


results = CSV.read("query_result.csv",{:headers => true, :encoding => 'ISO-8859-1'})

:headers => true : If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers. This setting causes #shift to return rows as CSV::Row objects instead of Arrays and #read to return CSV::Table objects instead of an Array of Arrays.

irb(main):024:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true)
=> <#CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>
irb(main):025:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true).to_a
=> [#<CSV::Row "a":"1" "b":"2" "c":"3">]
irb(main):026:0> rows.first['a']
=> "1"

In above example you can clearly see that this also enables us to use data as hashes. The only thing you would need to be careful about while using headers: true that it won't allow any duplicate headers as keys are unique in hashes.

Sumeet Raina
  • 117
  • 6
1

Only do this

anyobject.to_csv(:encoding => 'utf-8')