2

I'm writing a spider with scrapy, and I find that some items are invisible by css rules, however, I want to select the visible ones only.

But xpath seems ok to those <span style="display:none"> which style is written directly, and not ok to those <style>.pigf{display:none}</style> which style is written as css rule.

It seems like I should render the css so that I could filter out the invisible items correctly, but how could I render it? Is there any simpler solution?

Example html:

<span>
    <style>
        .pigf{display:none}.n8T-{display:inline}.pGrH{display:none}.XUYD{display:inline}.jdKj{display:none}.r7fk{display:inline}.pkO2{display:none}.EzIC{display:inline}
    </style>
    <span class="55">
        27
    </span>
    <div style="display:none">
        36
    </div>
    <span style="display:none">
        174
    </span>
    <span class="pkO2">
        174
    </span>
    <span>
    </span>
    .
    <span style="display:none">
        10
    </span>
    <span class="pkO2">
        10
    </span>
    <div style="display:none">
        10
    </div>
    <span style="display:none">
        49
    </span>
    <span class="jdKj">
        49
    </span>
    <span style="display:none">
        84
    </span>
    <span>
    </span>
    <span class="n8T-">
        115
    </span>
    <span style="display:none">
        129
    </span>
    <div style="display:none">
        129
    </div>
    <div style="display:none">
        143
    </div>
    <span style="display:none">
        151
    </span>
    <div style="display:none">
        169
    </div>
    <span>
    </span>
    .
    <span class="14">
        75
    </span>
    <span class="XUYD">
        .
    </span>
    <div style="display:none">
        23
    </div>
    <span style="display:none">
        79
    </span>
    <span style="display: inline">
        114
    </span>
</span>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
damn_c
  • 2,442
  • 2
  • 15
  • 17
  • no direct solution, with `scrapy`, but you could use selenium to properly render the page (like a browser), with scrapy you just need to setup the correct xpaths. – eLRuLL Nov 30 '15 at 15:54

1 Answers1

1

To make things reliable, you need something to render the HTML in - ideally a real browser. Look into selenium package that you can use to automate browsers. Note that a browser can also be headless, like PhantomJS.

selenium can easily distinguish visible and invisible elements. There is a relevant is_displayed() method that you may use to check the visibility. Also, if you would get a text of an element, according to the specification, it would return you the visible part of the text only.

You may also get away with rendering your page in Splash with a help of scrapy-splash middleware. Example usage can be found here.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195