1

I've been working with Nokogiri for a couple of days and I absolutely adore it. Everything was working brilliantly until I got a requirement to scrape a website that uses the data-reactid javascript attribute tag. The problem is that Nokogiri seems to be getting confused with the attribute id format this website is using (several periods, some dollar signs and some other invalid xml/css characters):

An example of what I need to scrape would be:

<td data-reactid=".3.3.1:$contract_23.$=1$dataRow:0.1">94.280</td>

I need the value (94.280) inside of the attribute with an id of ".3.3.1:$contract_23.$=1$dataRow:0.1"

which usually in nokogiri we would select by doing something like:

doc.css("type[attributename=attributeid]")

in my example it would be:

doc.css("td[data-reactid=.3.3.1:$contract_23.$=1$dataRow:0.1]")

but no matter what I do to escape the invalid characters, it keeps telling me there is an invalid character after my equals sign:

Error message for code above:

nokogiri-1.4.3.1/lib/nokogiri/css/parser.rb:78:in `on_error': unexpected '.3' after 'equal'

I've tried:

a) Getting my string defined as a variable and forced into a string

b) Escaping it with backslashes (.3.[...])

c) Prefixing it with a hash (#.3.3[...])

d) Escaping it using cgi escapedString

e) Placing it inside '%{ }' eg '%{.3.3[...]}'

No matter what I do, I keep getting the same message (except for option e which gives me an altogether different error message:

: no .<digit> floating literal anymore; put 0 before dot

Can you guys help me get the right value with such an oddly-named attribute?

jordanhill123
  • 4,142
  • 2
  • 31
  • 40
Antonio
  • 45
  • 4

1 Answers1

2

You didn't show how you are parsing your document, but if I parse it as HTML and then use single quotes around the attribute value in the css selector, I can get the tag:

require 'nokogiri'

html = <<END_OF_HTML
<td data-reactid="hello">10</td>
<td data-reactid=".3.3.1:$contract_23.$=1$dataRow:0.1">94.280</td>
<td data-reactid="goodbye">20</td>
END_OF_HTML

html_doc = Nokogiri::HTML(html)

html_doc.css("td[data-reactid='.3.3.1:$contract_23.$=1$dataRow:0.1']").each do |tag|
  puts tag.text
end


--output:--
94.280

Check out the Mothereffing Unquoted Attribute Value Validator via this SO post:

CSS attribute selectors: The rules on quotes (", ' or none?)

Community
  • 1
  • 1
7stud
  • 46,922
  • 14
  • 101
  • 127
  • Wow that was fast! Thanks a lot! I am using HTML as my parser and your method has indeed worked! The magic of single quotes inside of double ones! This works a treat! – Antonio Oct 02 '14 at 00:01
  • @Antonio, You're welcome. Note that double quotes are used when you want to interpolate something into the string, e.g. `planet = "earth"; puts "hello #{planet}"` Because you don't need to interpolate anything into your css selector, it would actually make more sense to use single quotes on the outside and double quotes on the inside: `'td[data-reactid=".3.3.1:$contract_23.$=1$dataRow:0.1"]'` And the double quotes on the inside have a certain matching symmetry with the double quotes used in the html. – 7stud Oct 02 '14 at 00:18
  • @Antonio, As for `%q{}` and `%Q{}`, they can't be used inside a string, but you could have done this: `%q{td[data-reactid=".3.3.1:$contract_23.$=1$dataRow:0.1"]}` But there is no reason to use `%q{}` (or `%Q{}`) there when it's clearer to use single quotes. – 7stud Oct 02 '14 at 00:20