1

I got this HTML (simplified):

<td class="pad10">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
</td>

I want to get dict structure which contains (row means table content separated by dates in main table):

{'04.09.2013': [1 row, 2 row],

 '05.10.2013': [1 row, 2 row, 3 row, 4 row]}

I can extract all 'div' with:

dt = s.xpath('//div[contains(@class, "button-left")]')

I can extract all 'table' with:

tables = s.xpath('//table[contains(@class, "record generic schedule margin-4")]')

But I don't know how to link 'dt' with corresponding 'tables' in Scrapy parser. It's possible to create a condition on scraping process, like this: if you found 'div' then you extract all next 'table' till you found other 'div'?

With Chrome i get two xPath examples of these elements:

//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/div[2]
//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/table[1]

Maybe it will help to image full structure of table.

Solution (thanks to @marven):

    s = Selector(response)

    table = {}
    current_key = None
    for e in s.xpath('//td[@class="pad10"]/*') :

        if bool(int(e.xpath('@class="button-left"').extract()[0])):
            current_key  = e.xpath('text()').extract()[0]
        else:
            if bool(int(e.xpath('@class="record generic schedule margin-4"').extract()[0])):
               t = e.extract()
               if current_key in table:
                   table[current_key].append(t)
               else:
                   table[current_key] = [t]
            else:
                pass
kepurlaukis
  • 324
  • 5
  • 16
  • Let's add my goal. I want to parse all scheduler and save it to database. Link:http://www.eurobasket2013.org/en/cid_8Xfg3jZMG1QuJnp6pnUWd3.pageID_hNYxJM-WHQcNWJ9a6IY-I2.compID_qMRZdYCZI6EoANOrUf9le2.season_2013.roundID_8722.html – kepurlaukis Aug 01 '14 at 23:30

3 Answers3

0

With that particular format you could get do this:

Get parent table: t = s.xpath('//div[contains(@class, "button-left")]/..')

Get first div: t.xpath('/div[1]') -- you might have to use position()=1

Get first two rows: t.xpath('/table[position() < 3]')

Get second div: t.xpath('/div[2]')

Get the rest of tables: t.xpath('/table[position() > 2')

This is very brittle and if this html changes this code won't work. It was hard answering this with the simplified html that you supplied and without knowing whether or not this structure is static or if it will change in the future. I would've asked these things in comment but I don't have enough rep :P

sources:

How to read attribute of a parent node from a child node in XSLT

What is the xpath to select a range of nodes?

https://stackoverflow.com/a/2407881/2368836

Community
  • 1
  • 1
rocktheartsm4l
  • 2,129
  • 23
  • 38
  • It wont change. I will use it only once. Just parse and save to database parsed info. The easiest way for me would be just copy this HTML part and parse it with simple iteration over the nodes like XML file. But I want to use Scrapy and also to find elegant solution for this. In mind to use something like group all the siblings... but for now I need to get more information and practice of using xPath... – kepurlaukis Aug 01 '14 at 23:22
  • I added link to may question. You can see this table structure. – kepurlaukis Aug 02 '14 at 08:31
0

See if this approach is applicable for your case : XPATH get all nodes between text_1 and text_2

Using the same approach as in the linked question above, basically we can filter <table> to only those having preceding-sibling and following-sibling specific <div>. For example (using XPath criteria you've posted for getting the <table>s and the <div>s) :

//table
    [contains(@class, "record generic schedule margin-4")]
    [
        preceding-sibling::div[contains(@class, "button-left")] 
            and 
        following-sibling::div[contains(@class, "button-left")]
    ]
Community
  • 1
  • 1
har07
  • 88,338
  • 12
  • 84
  • 137
  • Good example, but didn't work. There are unequal number of between two
    and after the last
    there are few more node, but no
    node in the end. So collecting between siblings will not help. I think what need to do is at first iteration get all
    , calculate something like coordination of each, and in the next iteration get all nodes between two coordination. Maybe coordination could be specific xPath...
    – kepurlaukis Aug 02 '14 at 09:48
0

What you can do is select all of the nodes and loop through them while checking whether the current node is a div or a table.

Using this as my test case,

<div class="asdf">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4">1</table>
  <table width="100%" class="record generic schedule margin-4">2</table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4">3</table>
  <table width="100%" class="record generic schedule margin-4">4</table>
  <table width="100%" class="record generic schedule margin-4">5</table>
  <table width="100%" class="record generic schedule margin-4">6</table>
</div>

I use the following to loop through the nodes and updating which div the current node is currently "under" in.

currdiv = None
mydict = {}
for e in sel.xpath('//div[@class="asdf"]/*'):
    if bool(int(e.xpath('@class="button-left"').extract()[0])):
        currdiv = e.xpath('text()').extract()[0]
        mydict[currdiv] = []
    elif currdiv is not None:
        mydict[currdiv] += e.xpath('text()').extract()

This results into:

{u'04.09.2013': [u'1', u'2'], u'05.10.2013': [u'3', u'4', u'5', u'6']}
marven
  • 1,836
  • 1
  • 17
  • 14