1

I want to select some supermarket product info from this page:

http://www.angeloni.com.br/super/index?grupo=15022

For that I should select <ul> tags with class "lstProd ":

If the class name were "lstProd" it would be easy, but the problem is the whitespace at the end of name. I couldn't make Jsoup deal with it.

I tried the code below and other ways but it always get an empty list.

org.jsoup.nodes.Document document = Jsoup.connect("http://www.angeloni.com.br/super/index?grupo=15022").get();
    org.jsoup.select.Elements list = doc.select("ul.lstProd  ");

the code snippet from html page that I want to get:

<ul class="lstProd  ">
    <li>
        <span class="cod">CÓD. 1341372</span>
        <span class="lnkImgProd">
            <a href="/super/produto?grupo=15022&amp;idProduto=1341372">
                <img src="http://assets.angeloni.com.br/files/images/7/1B/C6/1341372_1_V.jpg" width="120" height="120"
                     alt="Creme Dental SORRISO Super Refrescante Tubo 90g">
            </a>
                    </span>
        <div class="RgtDetProd">
            <div class="boxInfoProd">
                <span class="descr">
                    <a href="/super/produto?grupo=15022&amp;idProduto=1341372">Creme Dental SORRISO Super Refrescante
                        Tubo 90g</a>

                                    </span>

                <ul class="lstProdFlags after">
                </ul>
            </div>
...
alexpfx
  • 6,412
  • 12
  • 52
  • 88

2 Answers2

1

I think you are facing two completely separate problems:

  1. Jsoup does not load the site you think it loads. The website you specified renders its contents via JavaScript and loads some content after initial page loading through AJAX. JSoup can't deal with this. You either need to investigative the AJAX calls and get them directly with Jsoup, or you use something like selenium webdriver to get the page in a real browser which will render everything as you expect it.

  2. CSS class names can't contain spaces for practical purposes 1. In HTML spaces are used as separator between class names. Hence <ul class="lstProd "> is the same as <ul class="lstProd">. In CSS selectors however a class name is specified by .className, i.e. dot followed by the class name. You can concatinate several classes like this: element.select(".className1.className2")

1 Technically you can put spaces in CSS classes, but you need to escape them with '\ '. See https://mathiasbynens.be/notes/css-escapes or Which characters are valid in CSS class names/selectors?

edit: be more precise about CSS class names

Community
  • 1
  • 1
luksch
  • 11,497
  • 6
  • 38
  • 53
  • I understand. I work with Jaunt Api that is a simply Java Web Crawler and it can deal with that page. But Jaunt is not free so I move to JSoup. Do you know other API that can handle this sort of thing? I think the selenium webdriver is a little too heavy or not? – alexpfx Jan 27 '16 at 00:41
  • I confused. The Jaunt also can not handle because of that that you said. I will try the webdriver. – alexpfx Jan 27 '16 at 01:21
  • webdriver has a binding to HTML-Unit which is a Java only solution and lightweight. The problem is that it often does not behave as normal browsers do with complex JavaScript. It is worth a try though. Next best option would be the phantomjs binding, which at lease is headless. – luksch Jan 27 '16 at 09:45
1

CSS class names CAN contain whitespaces.
And <ul class="lstProd "> is NOT same as <ul class="lstProd">.

And I can see that you have multiple <ul> with same class name.
The better way to inspect or traverse such element is by nth-child
So to find your required selector you can use #abaProd > ul:nth-child(4)
For more details about nth-child

Gaurav Lad
  • 1,788
  • 1
  • 16
  • 30
  • I think your statement about spaces in CSS names is misleading. if a space is part of the name, you MUST escape it in HTML. A simple `\ ` will do it. If you do not escape the spaces, they count as separator between class names. See https://mathiasbynens.be/notes/css-escapes or https://www.w3.org/TR/CSS21/grammar.html#scanner – luksch Jan 27 '16 at 09:39