0

I am trying to write a web scraper using scrapy and xpath but I am experiencing a frustrating problem.

I need the text in a paragraph which has HTML

    <p class="list-details__item__date" id="match-date">04.03.2017 - 15:00</p>

I might be wrong, but since the p has an id attribute, it should be referable simply using

    response.xpath('//p[@id="match-date"]/text()').extract()

Anyway this won't work.

I know a little of xpath and I was able to write scrapers in the past, but this one is giving me troubles. I tried many solutions, but no one seems to work

    response.xpath('//p[contains(@class, "list-details__item__date") and contains(@id,"match-date")]/text()').extract()

    response.xpath('//p[@class="list-details__item__date" and @id="match-date"]/text()').extract()

I also tried using "contains" as stated in many answers, but it did not work as well. This might be a stupid mistake I am doing...it would be great if someone could help me!

Thank you so much

peppuce
  • 5
  • 4
  • Your example input shows `"match-date"` with a dash, and your XPath uses an `_` (`"match_date"`). Try `response.xpath('//p[@id="match-date"]/text()').extract()` – paul trmbrth Mar 03 '17 at 16:50
  • thanks, that was a typo due to copy/paste mess...I fixed it now – peppuce Mar 03 '17 at 17:03
  • btw I am able to extract other elements from the page...this one seems to give problems because of the multiple attributes – peppuce Mar 03 '17 at 17:10

1 Answers1

0

Maybe match-date is loaded via AJAX/JS ... Please disable Javascript in your browser and then see if match-date is there or not.

Also for seek of easiness, use CSS Selectors instead of xPaths.

response.css('#match-date::text').extract()

EDIT:

To get value of data-dt attribute do this

 response.css('#match-date::attr(data-dt)').extract()

OR XPath

response.xpath('//p[@id="match-date"]/@data-dt').extract()
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • Hi @Umair and thanks for your answer...you are right, I disabled JavaScript and the code changed...the id is still there, but now there is no text in the paragraph, but it has an attribute `data-dt="4,3,2017,15,00"`...I will try to access it from my code and let you know – peppuce Mar 03 '17 at 17:25
  • Thanks a lot @umair, I fixed my xpath to `response.xpath('//p[@id="match-date"]/@data-dt').extract()` and it is working (too late now to learn about css selectors :))...thanks again!!! – peppuce Mar 03 '17 at 17:30
  • just wondering if there was any other way to have scrapy read the same code as I do with javascript enabled...it would be a lot easier – peppuce Mar 03 '17 at 17:34
  • @peppuce `to have scrapy read the same code as I do with javascript enabled` NOT POSSIBLE with scrapy only ... you will have to use Selenium+PhantomJS along with Scrapy See my answer http://stackoverflow.com/a/40833619/4094231here – Umair Ayub Mar 03 '17 at 18:15
  • @peppuce also see edits in my answer ... please accept my answer – Umair Ayub Mar 03 '17 at 18:17