0

I have created a spider which checks for a particular movie booking site whether the film is opened for booking. It checks of every 10 seconds. But the problem I'm facing is, even when the booking is opened in the website, my code doesn't get the updated website, instead using the old scraped data.

for example:

I scraped the site and film 'A' is not opened for booking at 8AM. Booking for film 'A' is opened at 12PM, but the spider shows it's not opened for booking. To be noted, i'm using a indefinite while loop so I started running the program from 8AM and never stopped.

Code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import threading
import time
import datetime
import winsound

class NewFilmSpiderSpider(scrapy.Spider):
    name = 'new_film_spider'
    allowed_domains = ['www.spicinemas.in']
    start_urls = ['https://www.spicinemas.in/coimbatore/now-showing']

    def parse(self, response):
        t = threading.Thread(self.getDetails(response))
        t.start()

    def getDetails(self, response):
        while True:
            records = response.xpath('//section[@class="main-section"]/section[2]/section[@class="movie__listing now-showing"]/ul/li/div/dl/dt/a/text()').extract()
            if 'NGK' in str(records):
                try:
                    print("Booking Opened",datetime.datetime.now())
                    winsound.PlaySound('alert.wav', winsound.SND_FILENAME)
                except Exception:
                    print ("Error: unable to play sound")
            else:
                print("Booking Not Opened",datetime.datetime.now())
            time.sleep(10)

If you run the code now, it says booking opened. but I need to get the webpage scraped at every while loop. How can I do that?

Update #1:

I'm getting these trace when running using the solution given below

File "C:\Users\ranji\Documents\Spiders\SpiCinemasSpider\spicinemas_spider\spiders\new_film_spider.py", line 34, in <module>
    main()
  File "C:\Users\ranji\Documents\Spiders\SpiCinemasSpider\spicinemas_spider\spiders\new_film_spider.py", line 30, in main
    process.start()
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\crawler.py", line 293, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 1271, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 1251, in startRunning
    ReactorBase.startRunning(self)
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 754, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Ranjith Varatharajan
  • 1,596
  • 1
  • 33
  • 76
  • 1
    The issue you have is because you never requested for the updated version of the "response" data. Instead you are trying to extract a xpath value of the same "response" and expecting it to change, which it will never change once you "spidered" the page. You need to find a way to make scrapy spider the host again. – thuyein Jun 03 '19 at 02:34
  • But how can I call it again? i never used start_urls in any of the code. – Ranjith Varatharajan Jun 03 '19 at 02:36
  • 1
    You need to have a main function outside the class. Then have it create crawl runner instance `from scrapy.crawler import CrawlerRunner`. Then run it every 10 seconds. – thuyein Jun 03 '19 at 02:39
  • FYI it's __scraped__ (and __scrape__, __scraping__, __scraper__) not scrapped – DisappointedByUnaccountableMod Jun 07 '21 at 16:09

1 Answers1

0

The issue is because the thread is only working on the same set of "response" data every time and expecting it to change. The following is a modified code to show how it can be used to spider every 10 second and check for the xpath value.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request
import time
import datetime
import winsound

class NewFilmSpiderSpider(scrapy.Spider):
    name = 'new_film_spider'
    allowed_domains = ['www.spicinemas.in']
    start_urls = ['https://www.spicinemas.in/coimbatore/now-showing']

    def parse(self, response):
        records = response.xpath('//section[@class="main-section"]/section[2]/section[@class="movie__listing now-showing"]/ul/li/div/dl/dt/a/text()').extract()
        if 'NGK' in str(records):
            try:
                print("Booking Opened",datetime.datetime.now())
                winsound.PlaySound('alert.wav', winsound.SND_FILENAME)
            except Exception:
                print ("Error: unable to play sound")
        else:
            print("Booking Not Opened",datetime.datetime.now())


def main():
    try:
        process = CrawlerProcess()
        process.crawl(NewFilmSpiderSpider)
        process.start()

        while True:
            process.crawl(NewFilmSpiderSpider)
            time.sleep(10)
    except KeyboardInterrupt:
        process.join()


if __name__ == "__main__":
    main()

Reference: https://doc.scrapy.org/en/latest/topics/practices.html , https://stackoverflow.com/a/43480164/1509809

thuyein
  • 1,684
  • 13
  • 29