I am processing a file which contains data from the web and encounter invalid byte sequence in UTF-8 (ArgumentError) error on certain log files.
a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines
.
File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)
' : undefined method `encode!' for # (NoMethodError)
Whats the most straightforward way to filter/convert invalid UTF-8 characters during a File read?
Attempt 1
Tried this but it failed with same invalid byte sequence error.
IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s|
# extract three columns: time stamp, url, ip
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
Solution
This seems to have worked for me.
a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
Does Ruby provide a way to do File.read() with specified encoding?