File.readlines invalid byte sequence in UTF-8 (ArgumentError)

Question

I am processing a file which contains data from the web and encounter invalid byte sequence in UTF-8 (ArgumentError) error on certain log files.

a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a

I am trying to get this solution working. I have seen people doing

.encode!('UTF-8', 'UTF-8', :invalid => :replace)

but it doesnt appear to work with File.readlines.

File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)

' : undefined method `encode!' for # (NoMethodError)

Whats the most straightforward way to filter/convert invalid UTF-8 characters during a File read?

~~Attempt 1~~

Tried this but it failed with same invalid byte sequence error.

IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s|
  # extract three columns: time stamp, url, ip
  s = s.parse_csv;
  { timestamp: s[0], url: s[1], ip: s[3] }
end

Solution

This seems to have worked for me.

a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a

Does Ruby provide a way to do File.read() with specified encoding?

Maybe it's a BOM issue and you can solve it by opening the files in `'r:bom|utf-8'` mode. Please try that. — Patrick Oscity, Aug 19 '13 at 06:42
You should really be using CSV instead of File. That's what it's for. — pguardiario, Aug 19 '13 at 07:31
@pguardiario yes, I looked at that initially. could you please provide an example to the alternative above. thx. — pablo808, Aug 19 '13 at 07:35

7stud · Accepted Answer · 2017-01-03T22:05:08.913

I am trying to get this solution working. I have seen people doing
   .encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.

File.readlines returns an Array. Arrays don't have an encode method. On the other hand, strings do have an encode method.

could you please provide an example to the alternative above.

require 'csv'

CSV.foreach("log.csv", encoding: "utf-8") do |row|
  md = row[0].match /watch\?v=/
  puts row[0], row[1], row[3] if md
end

Or,

CSV.foreach("log.csv", 'rb:utf-8') do |row|

If you need more speed, use the fastercsv gem.

This seems to have worked for me.
File.readlines('log.csv', :encoding => 'ISO-8859-1')

Yes, in order to read a file you have to know its encoding.

score 0 · Answer 2 · answered Mar 28 '18 at 20:34

In my case the script defaulted to US-ASCII and I wasn't at liberty to change it on the server for risk of other conflicts.

I did

File.readlines(email, :encoding => 'UTF-8').each do |line|

but this didn't work with some Japanese characters so I added this on the next line and that worked fine.

line = line.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

File.readlines invalid byte sequence in UTF-8 (ArgumentError)

2 Answers2

Linked