1

I am trying to scrape a betting site. However, when I check for the retrieved data in scrapy shell, I receive nothing.

The xpath to what I need is: //*[@id="yui_3_5_0_1_1562259076537_31330"] and when I write in the shell this is what I get:


In [18]: response.xpath ( '//*[@id="yui_3_5_0_1_1562259076537_31330"]')
Out[18]: []

The output is [] but I expected to be something from which I could extract the href.

When I use the "inspect" tool from Chrome, while the site is still loading, this id is outlined in purple. Does this mean that the site is using JavaScipt? And if this is true, is this the reason why scrapy does not find the item and returns []?

Ale0311
  • 13
  • 4
  • 1
    What's the page you are crawling? – gunesevitan Jul 04 '19 at 18:21
  • What is the site you are trying to scrape? – GmrYael Jul 04 '19 at 18:40
  • https://www.betfair.ro/sport/home#sscpl=ro – Ale0311 Jul 04 '19 at 18:54
  • 1
    The site is using javascript script to generate random id of the elements. You can try to use the class attribute or best xpath query. What item are you trying to scrape? – GmrYael Jul 04 '19 at 19:52
  • do print(response.text) to see what you're really getting. then investigate whats going on with the JS and either Splash it or Selenium it if necessary. My order of operations goes Scrapy > Splash > Selenium – ThePyGuy Jul 05 '19 at 04:03
  • Also be sure to set USER_AGENT in your settings as that will be passed on to scrapy shell instances. – ThePyGuy Jul 05 '19 at 04:36
  • @gmrYael initially I wanted to scrape the titles of all the live matches. Then I tried to scrape the titles of the football matches, but I got the same problem. I’ll try to scrape after class attributes and I’ll get back to you guys. Thanks! – Ale0311 Jul 05 '19 at 08:19
  • @ThePyGuy I tried printing the response, but I got nothing. I’ll give splash a try also and see what I get. Thanks! About USER_AGENT, why is that necessary and where to set it? – Ale0311 Jul 05 '19 at 08:21
  • See https://stackoverflow.com/q/8550114/939364 and https://docs.scrapy.org/en/master/topics/dynamic-content.html – Gallaecio Jul 05 '19 at 13:25

1 Answers1

0

i try scraping the site just using Scrapy and this is my result.

This the items.py file

    import scrapy

    class LifeMatchsItem(scrapy.Item):

        Event = scrapy.Field() # Name of event
        Match = scrapy.Field() # Teams1 vs Team2
        Date = scrapy.Field()  # Date of Match

This is my Spider code


    import scrapy
    from LifeMatchesProject.items import LifeMatchsItem


    class LifeMatchesSpider(scrapy.Spider):
        name = 'life_matches'
        start_urls = ['http://www.betfair.com/sport/home#sscpl=ro/']

        custom_settings = {'FEED_EXPORT_ENCODING': 'utf-8'}

        def parse(self, response):
            for event in response.xpath('//div[contains(@class,"events-title")]'):
                for element in event.xpath('./following-sibling::ul[1]/li'):
                    item = LifeMatchsItem()
                    item['Event'] = event.xpath('./a/@title').get()
                    item['Match'] = element.xpath('.//div[contains(@class,"event-name-info")]/a/@data-event').get()
                    item['Date'] = element.xpath('normalize-space(.//div[contains(@class,"event-name-info")]/a//span[@class="date"]/text())').get()
                    yield item

And this is the result

file.json

GmrYael
  • 385
  • 3
  • 11
  • Thanks a lot! This was very helpful. However, I have one more, silly, question. How can you print the scraped data is such format? I only managed to print it in .css or .json format? – Ale0311 Jul 08 '19 at 09:51
  • Scrapy has these formats json,csv and xlsx https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format – GmrYael Jul 08 '19 at 16:37
  • And in what format did you print it in the photo you posted? – Ale0311 Jul 09 '19 at 17:56