0

Again I seem to have a brick wall with this one and I'm hoping somebody would be able to answer it off the top of their head.

Here's an example code below:

def parse_page(self,response):
    hxs = HtmlXPathSelector(response)

    item = response.meta['item']
    item["Details_H1"] = hxs.select('//*[@id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()').extract()
    return item

It seems that the @id in the Details_H1 could change. E.G. For a page it could be @id="ctl08_p_ctl17_ctl04_ctl01_ctl00_dlProps and for the next page it's randomly @id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps.

I would like to implement a do until loop equivalent such that the code cycles through the numbers with increments of 1 until the value being yielded by the XPath is non-zero. So for example I could set i=108 and would i=i+1 each time until hxs.select('//*[@id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()').extract() <> []

How would I be able to implement this?

Your help and contribution is greatly appreciated

EDIT 1

Fix addressed by TNT below. Code should read:

def parse_page(self,response):
    hxs = HtmlXPathSelector(response)

    item = response.meta['item']
    item["Details_H1"] = hxs.select('//*[contains(@id, "_p_ctl17_ctl04_ctl01_ctl00_dlProps")]/tr[1]/td[1]/text()').extract()
    return item
slixor
  • 45
  • 9
  • Use a global variable or an argument that serves as a counter, and format your string to fit that. – aIKid Nov 17 '13 at 15:53
  • I'm not too familiar with python syntax. Could you please provide me with an example or link me to an article where it's covered – slixor Nov 17 '13 at 23:46

2 Answers2

1

The 'natural' XPATH way would be to more generalize your xpath expresssion:

xp = '//*[contains(@id, "_p_ctl17_ctl04_ctl01_ctl00_dlProps")]/tr[1]/td[1]/text()'
item["Details_H1"] = hxs.select(xp).extract()

But I'm groping in the dark. Your xpath expression would probably better begin with something like //table or //tbody

In any case a "do until" would be ugly.

TNT
  • 3,392
  • 1
  • 24
  • 27
  • I'll try your approach too and will report back! Reason why I removed the `//tbody` is because I wasn't able to pull any results as per this post here: [link](http://stackoverflow.com/questions/7941060/parsing-html-with-xpath-python-scrapy). Cheers – slixor Nov 19 '13 at 23:04
  • Just tried it and it works a treat! I have edited my code above (in my initial post) to reflect the fix. Just as an FYI, is there any way to have an `OR` function for the `contains`. I.E. Let's say if it contains `"ctl17"` or `"ctl01_ctl00_dlProps"`. Is it as easy as `[contains(@id, "ctl17", "ctl01_ctl00_dlProps")]`? Cheers – slixor Nov 20 '13 at 15:18
0

You can try this

i = 108
while True:
    item = response.meta['item']
    xpath = '//*[@id="ct%d_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()' %i
    item["Details_H1"] = hxs.select(xpath).extract()
    if not item["Details_H1"]:
        break
    i += 1
    yield item
Omair Shamshir
  • 2,126
  • 13
  • 23
  • Hi omair, I tried the code you wrote but it's not working on my end. I changed the i=8 and the xpath='//*[@id="ctl0%d_p_c.....' %i since it's l08 not 108. I even tried changing the indent position of yield item Do you have any other suggestions? I appreciate your help – slixor Nov 18 '13 at 15:12
  • EDIT: Should there be an i = i + 1 somewhere? – slixor Nov 18 '13 at 15:21
  • can you specify what problem is it causing? – Omair Shamshir Nov 19 '13 at 06:26
  • I'll try the new syntax tonight. It wasn't producing any output. Are you able to edit your code to include the `def parse_page(self,response): hxs = HtmlXPathSelector(response)`? I just need to know how the indentation should look. I also assume `return item` is replaced with `yield item` – slixor Nov 19 '13 at 06:41
  • Still couldn't get it to work. Do you have a full code you'd be able to supply as a demonstration? That way I could also get better context. Thanks – slixor Nov 20 '13 at 15:22