5

I'm using Rails 5 with Ruby 4.2 and scanning a document that I parsed with Nokogiri, looking in a case insensitive way for a link with text:

a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil 

After getting the HTML of my web page in content, I parse it into a Nokogiri doc using:

doc = Nokogiri::HTML(content) 

The problem is, I'm getting

ArgumentError invalid byte sequence in UTF-8

on certain web pages when using the above regular expression.

2.4.0 :002 > doc.encoding
 => "UTF-8" 
2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text }
ArgumentError: invalid byte sequence in UTF-8
    from (irb):3:in `==='
    from (irb):3:in `block in irb_binding'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each'
    from (irb):3:in `detect'
    from (irb):3
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
    from bin/rails:4:in `require'
    from bin/rails:4:in `<main>' 

Is there a way I can rewrite the above to automatically account for the encoding or weird characters and not flip out?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Dave
  • 15,639
  • 133
  • 442
  • 830
  • Please read "[mcve]". When asking about a problem with the code we need to see the minimum code and minimum input data that demonstrate the problem. Your first line of code is questionable and implies that the code before it isn't written clearly, but, of course, without seeing it we can't help you there. In the internet wilds its very common to find pages that were not generated correctly or cleanly, often containing characters that were entered using the keypad on Windows machines resulting in ISO-8859-1 or Win-1252 characters being injected into the text. Convert those prior to parsing. – the Tin Man Feb 27 '17 at 23:35

1 Answers1

4

Your question may have already been answered before. Have you tried the methods from "Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?"?

Specifically before the detect block, try to remove the invalid bytes and control characters except new line:

doc.scrub!("")
doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

Remember, scrub! is a Ruby 2.1+ method.

Community
  • 1
  • 1
ErvalhouS
  • 4,178
  • 1
  • 22
  • 38
  • 1
    Scrubbing isn't the first choice. Instead, the characters are usually ISO-8859-1 or Win-1252 characters and converting them to UTF-8 will preserve them; String's [`encode`](http://ruby-doc.org/core-2.4.0/String.html#method-i-encode) method is a starting point. See http://stackoverflow.com/a/17023810/128421 – the Tin Man Feb 27 '17 at 23:51