Despite of numerous SO threads on the topic, I'm having trouble with parsing CSV. It's a .csv file downloaded from the Adwords Keyword Planner. Previously, Adwords had the option of exporting data as 'plain CSV' (which could be parsed with the Ruby CSV library), now the options are either Adwords CSV or Excel CSV. BOTH of these formats cause this problem (illustrated by a terminal session):
file = File.open('public/uploads/testfile.csv')
=> #<File:public/uploads/testfile.csv>
file.read.encoding
=> #<Encoding:UTF-8>
require 'csv'
=> true
CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8
Let's change the encoding and see if that helps:
file.close
=> nil
file = File.open("public/uploads/testfile.csv", "r:ISO-8859-1")
=> #<File:public/uploads/testfile.csv>
file.read.encoding
=> #<Encoding:ISO-8859-1>
CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8
Let's try using a different CSV library:
require 'smarter_csv'
=> true
file.close
=> nil
file = SmarterCSV.process('public/uploads/testfile.csv')
ArgumentError: invalid byte sequence in UTF-8
Is this a no-win situation? Do I have to roll my own CSV parser?
I'm using Ruby 1.9.3p374. Thanks!
UPDATE 1:
Using the suggestions in the comments, here's the current version:
file_contents = File.open("public/uploads/new-format/testfile-adwords.csv", 'rb').read
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
file_contents.gsub!(/\0/, '') #needed because otherwise, I get "string contains null byte (ArgumentError)"
CSV.foreach(file_contents, :headers => true, :header_converters => :symbol) do |row|
puts row
end
This doesn't work - now I get a "file name too long" error.