SmarterCSV and file encoding issues in Ruby

Question

I'm working with a file that appears to have UTF-16LE encoding. If I run

File.read(file, :encoding => 'utf-16le')

the first line of the file is:

"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n

If I read the file using something like

csv_text = File.read(file, :encoding => 'utf-16le')

I get an error stating

ASCII incompatible encoding needs binmode (ArgumentError)

If I switch the encoding in the above to

csv_text = File.read(file, :encoding => 'utf-8')

I make it to the SmarterCSV section of the code, but get an error that states

`=~': invalid byte sequence in UTF-8 (ArgumentError)

The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:

require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
  csv_text = File.read(file, :encoding => 'utf-16le')
  File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
  puts 'made it here'
  SmarterCSV.process('/tmp/tmp_file', {
    :col_sep => "\t",
    :force_simple_split => true,
    :headers_in_file => false,
    :user_provided_headers => headers
   }).each do |row|
    converted_row = {}
    converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
    converted_row[:timestamp] = row[:timestamp]
    converted_row[:sender] = row[:sender][2..-2]
    converted_row[:phone_number] = row[:phone_number][2..-2]
    converted_row[:message] = row[:message][1..-2]
    converted_row[:room] = file.gsub(path, '')
  end
end

Update - 05/13/15

Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.

What happens if you use the built-in [CSV](http://ruby-doc.org/stdlib-2.2.2/libdoc/csv/rdoc/index.html) class? — the Tin Man, May 06 '15 at 19:26

Rots · Accepted Answer · 2015-05-13T21:44:20.600

2

Add binmode to the File.read call.

File.read(file, :encoding => 'utf-16le', mode: "rb")

"b" Binary file mode Suppresses EOL <-> CRLF conversion on Windows. And sets external encoding to ASCII-8BIT unless explicitly specified.

ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read

Now pass the correct encoding to SmarterCSV

SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...

Update

It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.

edited May 13 '15 at 21:44

answered May 06 '15 at 21:26

Rots

5,506
3
43
51

This addresses the first issue I was having, but still causes the SmarterCSV issue: `=~': invalid byte sequence in UTF-8 (ArgumentError)` – AvocadoRivalry May 06 '15 at 22:24
Added to the answer to address your second problem. – Rots May 06 '15 at 22:29
It doesn't seem like there's a way to specify binmode for SmarterCSV - the error has now changed to what I was originally getting in my `File.read` step: `ASCII incompatible encoding needs binmode (ArgumentError)`. – AvocadoRivalry May 06 '15 at 22:33
In the example in SmarterCSV it has this `f = File.open(filename, "r:bom|utf-8")`. perhaps change your write statement to `File.open('/tmp/tmp_file', 'w:bom|utf-16le')` – Rots May 06 '15 at 22:38
Nope, no luck. Any reason why this code would work in a Rails console, but not just simply running the code as a Ruby program (subbing out the SmarterCSV gem)? – AvocadoRivalry May 06 '15 at 22:54
On the Ruby console you might need to specify the encoding, eg `ruby -E UTF-16LE program.rb` – Rots May 06 '15 at 22:56
Still no luck. I'm going to just call it a day on this one. Encoding can be unnecessarily frustrating. – AvocadoRivalry May 06 '15 at 23:05
Sorry to hear that. If you're willing to share the file, I'd be interested to have a go at trying it out (Would also like to know the answer!) Perhaps I can just use that first line you posted? – Rots May 06 '15 at 23:06
I can share a sample file definitely - appreciate the offer. How do I go about doing that through Stack Overflow? Also, is there a surefire way to determine the encoding of a file? Perhaps it's as simple as me not having the correct encoding. – AvocadoRivalry May 06 '15 at 23:09
http://www.filedropper.com/ for sharing the file is a good way. Unfortunately detecting code pages can be difficult, you really need to know what the file was - check this out http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file – Rots May 06 '15 at 23:13
Thanks for the link. Here's a sample file (same format as the ones I'm trying to read in and process): http://www.filedropper.com/psy-gentlemen-messages – AvocadoRivalry May 06 '15 at 23:16
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77132/discussion-between-rots-and-avocadorivalry). – Rots May 06 '15 at 23:16
Thanks @Rots for all the help - I updated my question with the solution I ultimately used and some commentary! – AvocadoRivalry May 13 '15 at 20:37
@AvocadoRivalry Nice job! You can add that as an answer if you like, rather than an update. If you'd like me to do that let me know, as I know we discussed this in our chat :) – Rots May 13 '15 at 21:02
No problem @AvocadoRivalry it was great working on this with you. I have updated this answer with your solution. Cheers!! – Rots May 13 '15 at 21:44

score 0 · Answer 2 · answered May 06 '15 at 20:36

Unfortunately, you're using a 'flat-file' style of storage and character encoding is going to be an issue on both ends (reading or writing).

I would suggest using something along the lines of str = str.force_encoding("UTF-8") and see if you can get that to work.

SmarterCSV and file encoding issues in Ruby

2 Answers2

Linked