23

I have the following code, which gives me an invalid byte sequence error pointing to the scan method in initialize. Any ideas on how to fix this? For what it's worth, the error does not occur when the (.*) between the h1 tag and the closing > is not there.

#!/usr/bin/env ruby

class NewsParser

  def initialize
      Dir.glob("./**/index.htm") do |file|
        @file = IO.read file 
        parsed = @file.scan(/<h1(.*)>(.*?)<\/h1>(.*)<!-- InstanceEndEditable -->/im)
        self.write(parsed)
      end
  end

  def write output
    @contents = output
    open('output.txt', 'a') do |f| 
      f << @contents[0][0]+"\n\n"+@contents[0][1]+"\n\n\n\n" 
    end
  end

end

p = NewsParser.new

Edit: Here is the error message:

news_parser.rb:10:in 'scan': invalid byte sequence in UTF-8 (ArgumentError)

SOLVED: The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and encoding: UTF-8 solve the issue.

Thanks!

redgem
  • 1,453
  • 4
  • 15
  • 28

2 Answers2

41

The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and #encoding: UTF-8 solved the issue.

redgem
  • 1,453
  • 4
  • 15
  • 28
1

While this question already has an accepted answer, I found it while having the same problem with a different style of opening the file:

File.open(file_name).each_with_index do |line, index|
  line.gsub!(/[{}]/, "'")
  puts "#{index} #{line}"
end

I found that my input file was encoded in ISO-8859-1, so I changed it to the following to avoid the error:

File.open(file_name, 'r:ISO-8859-1:utf-8').each_with_index do |line, index|
  line.gsub!(/[{}]/, "'")
  puts "#{index} #{line}"
end

See the documentation for the optional mode argument of the File.open method for more details.

FriendFX
  • 2,929
  • 1
  • 34
  • 63