scrapy selector return null when parsing url but ok when parsing saved url

Question

I'm trying to scrape data-table from the web using scrapy selector but got an empty array. The funny thing is when I tried to save the file and scrape it I got the expected array (non-null). Information on Scrapy version, selector command, and expected response can be found below.

Scrapy Version

Scrapy  : 0.18.2
lxml    : 3.2.3.0
libxml2 : 2.9.0
Twisted : 13.1.0
Python  : 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)]
Platform: Windows-8-6.2.9200

selector

hxs.select('//table[contains(@class,"mnytbl")]//tbody//td[contains(@headers,"tbl\x34\x37a")]//span/text()').extract()

Expected Response

[u'\n1.26 Bil\n        \n', u'\n893.90 Mil\n        \n', u'\n924.87 Mil\n
 \n', u'\n1.18 Bil\n        \n', u'\n1.55 Bil\n        \n', u'\n2.91 Bil\n
  \n', u'\n3.96 Bil\n        \n', u'\n4.01 Bil\n        \n', u'\n3.35 Bil\n
   \n', u'\n2.36 Bil\n        \n']

<url>: http://investing.money.msn.com/investments/financial-statements?symbol=SPF

Shell Command to connect to the web

$ scrapy shell <url>

Running the selector on return an empty array ([]). If I save the html output into a file (e.g. C:\src.html) and use the selector I got the expected response.

Thx!

score 2 · Answer 1 · edited May 23 '17 at 12:11

I understand you want to get the cells from the second column, the one with header "SALES"

I don't really know where your contains(@headers,"tbl\x34\x37a") predicate comes from, I think it may be due to dynamically generated "header" attributes for td.

I propose you try this rather scrary XPath expression

    //div[div[contains(span, "INCOME STATEMENT")]]
        //table[contains(@class,"mnytbl")]/tbody/tr
           /td[
               position() = (
                       count(../../../thead/tr/th[contains(., "SALES")]
                                        /preceding-sibling::th)
                       + 1
                   )
               ]

This borrows from Find position of a node using xpath to determine the position of an element

Explanations:

first find the first table: within a div that contains div, that contains a span with "INCOME STATEMENT"...
then find td cell, which position() is the same as the position of their cousin th cell with value "SALES"
../../.. is to go from td back to grand-grand-parent table, this can be simplified by ancestor::table[1] (first table ancestor)

So to get the text elements inside the span in each 2nd cell of every row of the first table, that would be:

hxs.select("""
    //div[div[contains(span, "INCOME STATEMENT")]]
        //table[contains(@class,"mnytbl")]/tbody/tr
           /td[
               position() = (
                       count(ancestor::table[1]
                                 /thead/tr/th[contains(., "SALES")]
                                          /preceding-sibling::th)
                       + 1
                   )
               ]/span/text()
""").extract()

Good to hear, @user1723988 ! you may accept the answer if you are happy with it, thanks. — paul trmbrth, Sep 23 '13 at 22:14

scrapy selector return null when parsing url but ok when parsing saved url

Scrapy Version

selector

Expected Response

Shell Command to connect to the web

1 Answers1