How to scrape URL/text, when the id contains special characters using Nokogiri

Question

I'm trying to scrape some data from https://webcat.schaeffler.com/web/schaeffler/pl/PKW/applicationSearch.xhtml.

I started to build the structure of my application:

require 'nokogiri'
require 'open-uri'

class Scrape

  def first(strona)
      @page = Nokogiri::HTML(open(strona))
  end

  def marka(css)
      @page.css(css).text
  end

end

x = Scrape.new

x.first("https://webcat.schaeffler.com/web/schaeffler/pl/PKW/index.xhtml")
puts x.marka("a#searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495")

It should put "ABARTH", but id includes special characters like ":" and the only thing that I get is:

unexpected '0' after ':' (Nokogiri::CSS::SyntaxError)

I found the solution on "Is there a way to escape non-alphanumeric characters in Nokogiri css?", so I changed the last line in my code to:

puts x.marka('*[id="searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495"]')

It returns an empty string, but I don't know why.

The element on the target site looks like:

<a id="searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495" href="/web/schaeffler/pl/PKW/3854/applicationSearch.xhtml" title="ABARTH">ABARTH</a>

I did something wrong or it doesn't work in my case.

Please see "[ask]" and the linked pages and "[mcve](https://stackoverflow.com/help/minimal-reproducible-example)". We need the minimal HTML in your question that will demonstrate the problem, along with the minimal running code that demonstrates it. If that page or site are unavailable your question won't make sense to those searching for a similar solution in the future, which is what SO is about. — the Tin Man, Nov 27 '19 at 01:42
The page you're using doesn't contain the ID you're searching for. Try `wget -O - https://webcat.schaeffler.com/web/schaeffler/pl/PKW/index.xhtml | grep searchByConstraints` and you should get some matches. Change the grep pattern to the full string and you won't get a match. That means the page is changing dynamically so you'll have to find a different way to locate the information you're after. — the Tin Man, Nov 27 '19 at 01:59

score 0 · Answer 1 · edited Nov 27 '19 at 01:40

0

Those ids are non-standard CSS so you can't search them as if they were a normal tag, which would be something like <div id="this-is-normal">. Instead you need to match a pattern in the div's id.

I think this is what you need:

@page.css('div[id^="searchByConstraints:form:j_idt491"]')

So for you it's:

x.marka('div[id^="searchByConstraints:form:j_idt491"]')

As a side note, I would suggest you name your class Scrapper or Scrape and also it's better practice to name your instance variable after the class, so

scrapper = Scrapper.new
scrapper.marka('blah') # etc. maybe you mean marker?

edited Nov 27 '19 at 01:40

the Tin Man

158,662
42
215
303

answered Nov 26 '19 at 04:34

lacostenycoder

10,623
4
31
48

A bigger problem is the CSS selector isn't right for the page. Either the page is changing dynamically or the selector is wrong. Either way the question isn't asked well. – the Tin Man Nov 27 '19 at 02:01
It gave me the same result, an empty string. But I solved it using XPATH than CSS. – matrix-9 Nov 28 '19 at 19:52

score 0 · Accepted Answer · answered Nov 28 '19 at 20:12

I had figured how to solve it. I used XPATH than CSS.

I change this code:

  def marka(css)
      @page.css(css).text
  end

puts x.marka("a#searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495")

To this:

def marka(css)
    @page.xpath(css).text
end

puts x.marka("//*[@id='searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495']")

How to scrape URL/text, when the id contains special characters using Nokogiri

2 Answers2