1

I did my research, but none of the solutions seems to work for my case. I had a gmail format csv, exported from gmail itself. My parsing code is simple as that:

CSV.open(file.path) do |csv|

Error is:

Unquoted fields do not allow \r or \n

I tried combinations of row_sep, encoding, but none of them helps. Any thoughts?

Ruby file read returned:

ruby -e 'p File.read("tmp/google.csv")'       

"\xFF\xFEN\u0000a\u0000m\u0000e\u0000,\u0000G\u0000i\u0000v\u0000e\u0000n\u0000 \u0000N\u0000a\u0000m\u0000e\u0000,\u0000A\u0000d\u0000d\u0000i\u0000t\u0000i\u0000o\u0000n\u0000a\u0000l\u0000 \u0000N\u0000a\u0000m\u0000e\u0000,\u0000F\u0000a\u0000m\u0000i\u0000l\u0000y\u0000 \u0000N\u0000a\u0000m\u0000e\u0000,\u0000Y\u0000o\u0000m\u0000i\u0000 \u0000N\u0000a\u0000m\u0000e\u0000,\u0000G\u0000i\u0000v\u0000e\u0000n\u0000 \u0000N\u0000a\u0000m\u0000e\u0000 \u0000Y\u0000o\u0000m\u0000i\u0000,\u0000A\u0000d\u0000d\u0000i\u0000t\u0000i\u0000o\u0000n\u0000a\u0000l\u0000 \u0000N\u0000a\u0000m\u0000e\u0000 \u0000Y\u0000o\u0000m\u0000i\u0000,\u0000F\u0000a\u0000m\u0000i\u0000l\u0000y\u0000 \u0000N\u0000a\u0000m\u0000e\u0000 \u0000Y\u0000o\u0000m\u0000i\u0000,\u0000N\u0000a\u0000m\u0000e\u0000 \u0000P\u0000r\u0000e\u0000f\u0000i\u0000x\u0000,\u0000N\u0000a\u0000m\u0000e\u0000 \u0000S\u0000u\u0000f\u0000f\u0000i\u0000x\u0000,\u0000I\u0000n\u0000i\u0000t\u0000i\u0000a\u0000l\u0000s\u0000,\u0000N\u0000i\u0000c\u0000k\u0000n\u0000a\u0000m\u0000e\u0000,\u0000S\u0000h\u0000o\u0000r\u0000t\u0000 \u0000N\u0000a\u0000m\u0000e\u0000,\u0000M\u0000a\u0000i\u0000d\u0000e\u0000n\u0000 \u0000N\u0000a\u0000m\u0000e\u0000,\u0000B\u0000i\u0000r\u0000t\u0000h\u0000d\u0000a\u0000y\u0000,\u0000G\u0000e\u0000n\u0000d\u0000e\u0000r\u0000,\u0000L\u0000o\u0000c\u0000a\u0000t\u0000i\u0000o\u0000n\u0000,\u0000B\u0000i\u0000l\u0000l\u0000i\u0000n\u0000g\u0000 \u0000I\u0000n\u0000f\u0000o\u0000r\u0000m\u0000a\u0000t\u0000i\u0000o\u0000n\u0000,\u0000D\u0000i\u0000r\u0000e\u0000c\u0000t\u0000o\u0000r\u0000y\u0000 \u0000S\u0000e\u0000r\u0000v\u0000e\u0000r\u0000,\u0000M\u0000i\u0000l\u0000e\u0000a\u0000g\u0000e\u0000,\u0000O\u0000c\u0000c\u0000u\u0000p\u0000a\u0000t\u0000i\u0000o\u0000n\u0000,\u0000H\u0000o\u0000b\u0000b\u0000y\u0000,\u0000S\u0000e\u0000n\u0000s\u0000i\u0000t\u0000i\u0000v\u0000i\u0000t\u0000y\u0000,\u0000P\u0000r\u0000i\u0000o\u0000r\u0000i\u0000t\u0000y\u0000,\u0000S\u0000u\u0000b\u0000j\u0000e\u0000c\u0000t\u0000,\u0000N\u0000o\u0000t\u0000e\u0000s\u0000,\u0000G\u0000r\u0000o\u0000u\u0000p\u0000 \u0000M\u0000e\u0000m\u0000b\u0000e\u0000r\u0000s\u0000h\u0000i\u0000p\u0000,\u0000E\u0000-\u0000m\u0000a\u0000i\u0000l\u0000 \u00001\u0000 \u0000-\u0000 \u0000T\u0000y\u0000p\u0000e\u0000,\u0000E\u0000-\u0000m\u0000a\u0000i\u0000l\u0000 \u00001\u0000 \u0000-\u0000 \u0000V\u0000a\u0000l\u0000u\u0000e\u0000,\u0000E\u0000-\u0000m\u0000a\u0000i\u0000l\u0000 \u00002\u0000 \u0000-\u0000 \u0000T\u0000y\u0000p\u0000e\u0000,\u0000E\u0000-\u0000m\u0000a\u0000i\u0000l\u0000 \u00002\u0000 \u0000-\u0000 \u0000V\u0000a\u0000l\u0000u\u0000e\u0000,\u0000P\u0000h\u0000o\u0000n\u0000e\u0000 \u00001\u0000 \u0000-\u0000 \u0000T\u0000y\u0000p\u0000e\u0000,\u0000P\u0000h\u0000o\u0000n\u0000e\u0000 \u00001\u0000 \u0000-\u0000 \u0000V\u0000a\u0000l\u0000u\u0000e\u0000,\u0000W\u0000e\u0000b\u0000s\u0000i\u0000t\u0000e\u0000 \u00001\u0000 \u0000-\u0000 \u0000T\u0000y\u0000p\u0000e\u0000,\u0000W\u0000e\u0000b\u0000s\u0000i\u0000t\u0000e\u0000 \u00001\u0000 \u0000-\u0000 \u0000V\u0000a\u0000l\u0000u\u0000e\u0000\r\u0000\n\u0000\u0010\u0004;\u00045\u0004:\u0004A\u00040\u0004=\u00044\u0004@\u0004,\u0000\u0010\u0004;\u00045\u0004:\u0004A\u00040\u0004=\u00044\u0004@\u0004,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000*\u0000 \u0000M\u0000y\u0000 \u0000C\u0000o\u0000n\u0000t\u0000a\u0000c\u0000t\u0000s\u0000 \u0000:\u0000:\u0000:\u0000 \u0000*\u0000 \u0000F\u0000r\u0000i\u0000e\u0000n\u0000d\u0000s\u0000,\u0000*\u0000 \u0000O\u0000t\u0000h\u0000e\u0000r\u0000,\u0000s\u0000t\u0000a\u0000r\u0000s\u0000h\u0000y\u0000n\u0000i\u0000n\u0000@\u0000g\u0000m\u0000a\u0000i\u0000l\u0000.\u0000c\u0000o\u0000m\u0000,\u0000,\u0000,\u0000,\u0000,\u0000,\u0000\r\u0000\n\u0000"

It's seems that google files had a strange encoding:

enca tmp/google.csv                                                                                                                                                                                                    
Universal character set 2 bytes; UCS-2; BMP
  CRLF line terminators
  Byte order reversed in pairs (1,2 -> 2,1)

File content:

Name,Given Name,Additional Name,Family Name,Yomi Name,Given Name Yomi,Additional Name Yomi,Family Name Yomi,Name Prefix,Name Suffix,Initials,Nickname,Short Name,Maiden Name,Birthday,Gender,Location,Billing Information,Directory Server,Mileage,Occupation,Hobby,Sensitivity,Priority,Subject,Notes,Group Membership,E-mail 1 - Type,E-mail 1 - Value,E-mail 2 - Type,E-mail 2 - Value,Phone 1 - Type,Phone 1 - Value,Website 1 - Type,Website 1 - Value
Александр,Александр,,,,,,,,,,,,,,,,,,,,,,,,,* My Contacts ::: * Friends,* Other,starshynin@gmail.com,,,,,,
Mikhail Nikalyukin
  • 11,867
  • 1
  • 46
  • 70

1 Answers1

2

You may need to specify the encoding when opening the file. Try using something like this until you manage to decode the file:

File.open(file.path, "rb:UTF-16BE").read.encode("utf-8")

The encoding of your file seems to be UTF-16, so try UTF-16, UTF-16LE and UTF-16BE.

After that you can try to feed the encoded data into a CSV reader like this:

CSV.open(File.open(file.path, "rb:UTF-16BE")) do |csv|

and process the file. You may need to re-encode the data into UTF-8 at some point. It all depends on your use case.

valo
  • 1,712
  • 17
  • 12
  • I'm able to read csv with encoding, but unfortunately file is always malformed after encoding `#`. – Mikhail Nikalyukin Mar 14 '14 at 11:28
  • @MikhailNikalyukin Which encoding you used ? – Arup Rakshit Mar 14 '14 at 11:29
  • Okay, got it working like this: `File.open(file.tempfile, "rb:UTF-16") do |f|` `csv = CSV.parse(f.read, :headers => true)` But usual files that not utf-16(ucs-2) not working with this approach, how to make this optional, because ruby returns `` for utf-16 and non-utf-16 files. – Mikhail Nikalyukin Mar 14 '14 at 11:42
  • https://github.com/brianmario/charlock_holmes this gem returns correct encoding `UTF-16LE`, while ruby says UTF-8 – Mikhail Nikalyukin Mar 14 '14 at 11:53
  • If you are not sure what encoding is your input you need to use some gem like charlock_holmes to detect it. There is no universal way for this and it could be a tough task. You can also check out https://rubygems.org/gems/rchardet19 – valo Mar 14 '14 at 13:14