How to scrape table with different xpath on the same level with Scrapy?

Question

I got this HTML (simplified):

<td class="pad10">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
</td>

I want to get dict structure which contains (row means table content separated by dates in main table):

{'04.09.2013': [1 row, 2 row],

 '05.10.2013': [1 row, 2 row, 3 row, 4 row]}

I can extract all 'div' with:

dt = s.xpath('//div[contains(@class, "button-left")]')

I can extract all 'table' with:

tables = s.xpath('//table[contains(@class, "record generic schedule margin-4")]')

But I don't know how to link 'dt' with corresponding 'tables' in Scrapy parser. It's possible to create a condition on scraping process, like this: if you found 'div' then you extract all next 'table' till you found other 'div'?

With Chrome i get two xPath examples of these elements:

//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/div[2]
//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/table[1]

Maybe it will help to image full structure of table.

Solution (thanks to @marven):

    s = Selector(response)

    table = {}
    current_key = None
    for e in s.xpath('//td[@class="pad10"]/*') :

        if bool(int(e.xpath('@class="button-left"').extract()[0])):
            current_key  = e.xpath('text()').extract()[0]
        else:
            if bool(int(e.xpath('@class="record generic schedule margin-4"').extract()[0])):
               t = e.extract()
               if current_key in table:
                   table[current_key].append(t)
               else:
                   table[current_key] = [t]
            else:
                pass

Let's add my goal. I want to parse all scheduler and save it to database. Link:http://www.eurobasket2013.org/en/cid_8Xfg3jZMG1QuJnp6pnUWd3.pageID_hNYxJM-WHQcNWJ9a6IY-I2.compID_qMRZdYCZI6EoANOrUf9le2.season_2013.roundID_8722.html — kepurlaukis, Aug 01 '14 at 23:30

score 0 · Answer 1 · edited May 23 '17 at 12:04

0

With that particular format you could get do this:

Get parent table: t = s.xpath('//div[contains(@class, "button-left")]/..')

Get first div: t.xpath('/div[1]') -- you might have to use position()=1

Get first two rows: t.xpath('/table[position() < 3]')

Get second div: t.xpath('/div[2]')

Get the rest of tables: t.xpath('/table[position() > 2')

This is very brittle and if this html changes this code won't work. It was hard answering this with the simplified html that you supplied and without knowing whether or not this structure is static or if it will change in the future. I would've asked these things in comment but I don't have enough rep :P

sources:

How to read attribute of a parent node from a child node in XSLT

What is the xpath to select a range of nodes?

https://stackoverflow.com/a/2407881/2368836

edited May 23 '17 at 12:04

Community

1
1

answered Aug 01 '14 at 20:57

rocktheartsm4l

2,129
23
38

It wont change. I will use it only once. Just parse and save to database parsed info. The easiest way for me would be just copy this HTML part and parse it with simple iteration over the nodes like XML file. But I want to use Scrapy and also to find elegant solution for this. In mind to use something like group all the siblings... but for now I need to get more information and practice of using xPath... – kepurlaukis Aug 01 '14 at 23:22
I added link to may question. You can see this table structure. – kepurlaukis Aug 02 '14 at 08:31

score 0 · Answer 2 · edited May 23 '17 at 12:11

0

See if this approach is applicable for your case : XPATH get all nodes between text_1 and text_2

Using the same approach as in the linked question above, basically we can filter <table> to only those having preceding-sibling and following-sibling specific <div>. For example (using XPath criteria you've posted for getting the <table>s and the <div>s) :

//table
    [contains(@class, "record generic schedule margin-4")]
    [
        preceding-sibling::div[contains(@class, "button-left")] 
            and 
        following-sibling::div[contains(@class, "button-left")]
    ]

edited May 23 '17 at 12:11

Community

1
1

answered Aug 01 '14 at 22:50

har07

88,338
12
84
137

Good example, but didn't work. There are unequal number of between two
and after the last
there are few more node, but no
node in the end. So collecting between siblings will not help. I think what need to do is at first iteration get all
, calculate something like coordination of each, and in the next iteration get all nodes between two coordination. Maybe coordination could be specific xPath...
– kepurlaukis Aug 02 '14 at 09:48

score 0 · Accepted Answer · answered Aug 02 '14 at 12:38

What you can do is select all of the nodes and loop through them while checking whether the current node is a div or a table.

Using this as my test case,

<div class="asdf">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4">1</table>
  <table width="100%" class="record generic schedule margin-4">2</table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4">3</table>
  <table width="100%" class="record generic schedule margin-4">4</table>
  <table width="100%" class="record generic schedule margin-4">5</table>
  <table width="100%" class="record generic schedule margin-4">6</table>
</div>

I use the following to loop through the nodes and updating which div the current node is currently "under" in.

currdiv = None
mydict = {}
for e in sel.xpath('//div[@class="asdf"]/*'):
    if bool(int(e.xpath('@class="button-left"').extract()[0])):
        currdiv = e.xpath('text()').extract()[0]
        mydict[currdiv] = []
    elif currdiv is not None:
        mydict[currdiv] += e.xpath('text()').extract()

This results into:

{u'04.09.2013': [u'1', u'2'], u'05.10.2013': [u'3', u'4', u'5', u'6']}

Your answer helped a lot. Thank you. I reached my goal to group tables to corresponding date. — kepurlaukis, Aug 02 '14 at 14:13

How to scrape table with different xpath on the same level with Scrapy?

3 Answers3