117

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

Marc Seeger
  • 2,717
  • 4
  • 28
  • 32
  • something that might break characters, but keeps the string valid for other libraries: valid_string = untrusted_string.unpack(‘C*’).pack(‘U*’) – Marc Seeger Aug 06 '11 at 07:17
  • Having the exact issue, tried the same other solutions. No love. Tried Marc's, but it seems to garble everything. Are you sure `'U*'` undoes `'C*'`? – Jordan Warbelow-Feldstein Oct 24 '11 at 03:05
  • No, it does not :) I just used that in a webcrawler where I care about 3rd party libraries not crashing more than I do about a sentence here and there. – Marc Seeger Nov 29 '12 at 09:48

12 Answers12

176

In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end
Mark Swardstrom
  • 17,217
  • 6
  • 62
  • 70
RubenLaguna
  • 21,435
  • 13
  • 113
  • 151
  • I've compared with my solution and found, that mine loses some letters, at least `ё`: `"Alena V.\"`. While your solution keeps it: `"Ale\u0308na V.\"`. Nice. – Nakilon Jan 16 '12 at 01:20
  • 3
    With some problematic input I also use a double conversion from UTF-8 to UTF-16 and then back to UTF-8 `file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')` `file_contents.encode!('UTF-8', 'UTF-16')` – RubenLaguna Jan 16 '12 at 09:28
  • 7
    There is also the option of `force_encoding`. If you have a read a ISO8859-1 as an UTF-8 (and thus that string contains invalid UTF-8) then you can "reinterpret" it as ISO8859-1 with the_string.force_encoding("ISO8859-1") and just work with that string in its real encoding. – RubenLaguna Feb 20 '12 at 14:36
  • 3
    That double encode trick just saved my Bacon! I wonder why it is required though? – johnf Mar 12 '12 at 02:32
  • I'm using this on my mysql database of Apple's affiliate feed for app store data. The double encode works! But the formatting on the app descriptions is messed up now :/ – nnyby May 05 '12 at 00:03
  • 1
    Where should i put those lines? – Lefsler Aug 30 '12 at 11:38
  • 5
    I think the double conversion works because it forces an encoding conversion (and with it the check for invalid characters). If the source string is already encoded in UTF-8, then just calling `.encode('UTF-8')` is a no-op, and no checks are run. [Ruby Core Documentation for encode](http://www.ruby-doc.org/core-1.9.3/String.html#method-i-encode). However, converting it to UTF-16 first forces all the checks for invalid byte sequences to be run, and replacements are done as needed. – Jo Hund Aug 11 '13 at 18:32
  • If you want an example string for which the double conversion is required, here's one I have `URI.decode("%E2%EF%BF%BD%A6-invalid")`. – gtd Feb 26 '14 at 15:51
83

The accepted answer nor the other answer work for me. I found this post which suggested

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

This fixed the problem for me.

Amir Raminfar
  • 33,777
  • 7
  • 93
  • 123
  • 1
    This fixed the problem for me and I like using non-deprecated methods (I have Ruby 2.0 now). – La-comadreja Apr 26 '14 at 19:51
  • 1
    This one is the only one that works! I have tried all of the above solution, none of them work String that used in testing "fdsfdsf dfsf sfds fs sdf
    hello

    fooo??? {!@#$%^&*()_+}

    \xEF\xBF\xBD \xef\xbf\x9c
    \xc2\x90
    \xc2\x90"
    – Chihung Yu Jan 07 '16 at 21:47
  • 1
    What is the second argument 'binary' for? – Henley Dec 18 '18 at 03:04
25

My current solution is to run:

my_string.unpack("C*").pack("U*")

This will at least get rid of the exceptions which was my main problem

Marc Seeger
  • 2,717
  • 4
  • 28
  • 32
  • 3
    I'm using this method in combination with `valid_encoding?` which seems to detect when something is wrong. `val.unpack('C*').pack('U*') if !val.valid_encoding?`. – Aaron Gibralter Jan 19 '12 at 16:41
  • This one worked for me. Successfully converts my `\xB0` back to degrees symbols. Even the `valid_encoding?` comes back true but I still check if it doesn't and strip out the offending characters using Amir's answer above: `string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')`. I had also tried the `force_encoding` route but that failed. – hamstar Aug 04 '14 at 23:48
  • This is great. Thanks. – d_ethier Dec 17 '15 at 03:58
8

Try this:

def to_utf8(str)
  str = str.force_encoding('UTF-8')
  return str if str.valid_encoding?
  str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end
Ranjithkumar Ravi
  • 3,352
  • 2
  • 20
  • 22
4
attachment = file.read

begin
   # Try it as UTF-8 directly
   cleaned = attachment.dup.force_encoding('UTF-8')
   unless cleaned.valid_encoding?
     # Some of it might be old Windows code page
     cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
   end
   attachment = cleaned
 rescue EncodingError
   # Force it to UTF-8, throwing out invalid bits
   attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end
rusllonrails
  • 5,586
  • 3
  • 34
  • 27
4

I recommend you to use a HTML parser. Just find the fastest one.

Parsing HTML is not as easy as it may seem.

Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.

Even inside attribute values you have to decode HTML entities like amp

Here is a great question that sums up why you can not reliably parse HTML with a regular expression: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Eduardo
  • 2,327
  • 5
  • 26
  • 43
  • 2
    I'd love to keep the regexp since it's about 10 times faster and I really don't want to parse the html correctly but just want to extract links. I should be able to replace the invalid parts in ruby by just doing: ok_string = bad_string.encode("UTF-8", {:invalid => :replace, :undef => :replace}) but that doesn't seem to work :( – Marc Seeger Jun 06 '10 at 11:02
3

This seems to work:

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end
Spajus
  • 7,356
  • 2
  • 25
  • 26
2

I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:

ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t
Nakilon
  • 34,866
  • 14
  • 107
  • 142
1

While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. CSV.parse(f, :encoding => "iso-8859-1"), which turned my freaky deaky cyrillic K's into a much more manageable /\xCA/, which I could then remove with string.gsub!(/\xCA/, '')

boulder_ruby
  • 38,457
  • 9
  • 79
  • 100
  • Again, I just want to note that while Nakilon's (and others) fix was for Cyrillic characters originating from (haha) Cyrillia, this output is standard output for a csv which was converted from xls! – boulder_ruby Oct 16 '12 at 03:57
0

Before you use scan, make sure that the requested page's Content-Type header is text/html, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href in something like a <link> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan.

Adrian
  • 14,931
  • 9
  • 45
  • 70
  • thanks, but that's not my problem :) I only extract the host part of the URL anyway and hit only the front page. My problem is that my input apparently isn't UTF-8 and the 1.9 encoding foo goes haywire – Marc Seeger Jun 06 '10 at 00:57
  • @Marc Seeger: What do you mean by "my input"? Stdin, the URL, or the page body? – Adrian Jun 06 '10 at 01:14
  • HTML can be encoded in UTF-8: http://en.wikipedia.org/wiki/Character_encodings_in_HTML – Eduardo Jun 06 '10 at 01:39
  • my input = the page body @Eduardo: I know. My problem is that the data coming from net/http seems to have a bad encoding from time to time – Marc Seeger Jun 06 '10 at 11:00
  • It's not uncommon for webpages to actually have bad encoding for real. The response header might say it's one encoding but then actually serving another encoding. – sunkencity Jan 12 '12 at 06:46
0

There is also the scrub method to filter invalid bytes.

string.scrub('')
rtrrtr
  • 563
  • 5
  • 7
-1

If you don't "care" about the data you can just do something like:

search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"

I just used valid_encoding? to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

pjammer
  • 9,489
  • 5
  • 46
  • 56