How to avoid "Invalid byte sequence" when looking for link with text using Nokogiri

Question

I'm using Rails 5 with Ruby 4.2 and scanning a document that I parsed with Nokogiri, looking in a case insensitive way for a link with text:

a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil

After getting the HTML of my web page in content, I parse it into a Nokogiri doc using:

doc = Nokogiri::HTML(content)

The problem is, I'm getting

ArgumentError invalid byte sequence in UTF-8

on certain web pages when using the above regular expression.

2.4.0 :002 > doc.encoding
 => "UTF-8" 
2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text }
ArgumentError: invalid byte sequence in UTF-8
    from (irb):3:in `==='
    from (irb):3:in `block in irb_binding'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each'
    from (irb):3:in `detect'
    from (irb):3
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
    from bin/rails:4:in `require'
    from bin/rails:4:in `<main>'

Is there a way I can rewrite the above to automatically account for the encoding or weird characters and not flip out?

Please read "[mcve]". When asking about a problem with the code we need to see the minimum code and minimum input data that demonstrate the problem. Your first line of code is questionable and implies that the code before it isn't written clearly, but, of course, without seeing it we can't help you there. In the internet wilds its very common to find pages that were not generated correctly or cleanly, often containing characters that were entered using the keypad on Windows machines resulting in ISO-8859-1 or Win-1252 characters being injected into the text. Convert those prior to parsing. — the Tin Man, Feb 27 '17 at 23:35

score 4 · Accepted Answer · edited May 23 '17 at 12:25

4

Your question may have already been answered before. Have you tried the methods from "Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?"?

Specifically before the detect block, try to remove the invalid bytes and control characters except new line:

doc.scrub!("")
doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

Remember, scrub! is a Ruby 2.1+ method.

edited May 23 '17 at 12:25

Community

1
1

answered Feb 23 '17 at 09:53

ErvalhouS

4,178
1
22
38

1

Scrubbing isn't the first choice. Instead, the characters are usually ISO-8859-1 or Win-1252 characters and converting them to UTF-8 will preserve them; String's [`encode`](http://ruby-doc.org/core-2.4.0/String.html#method-i-encode) method is a starting point. See http://stackoverflow.com/a/17023810/128421 – the Tin Man Feb 27 '17 at 23:51

How to avoid "Invalid byte sequence" when looking for link with text using Nokogiri

1 Answers1