Scrapy taking old text from webpage

Question

I have created a spider which checks for a particular movie booking site whether the film is opened for booking. It checks of every 10 seconds. But the problem I'm facing is, even when the booking is opened in the website, my code doesn't get the updated website, instead using the old scraped data.

for example:

I scraped the site and film 'A' is not opened for booking at 8AM. Booking for film 'A' is opened at 12PM, but the spider shows it's not opened for booking. To be noted, i'm using a indefinite while loop so I started running the program from 8AM and never stopped.

Code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import threading
import time
import datetime
import winsound

class NewFilmSpiderSpider(scrapy.Spider):
    name = 'new_film_spider'
    allowed_domains = ['www.spicinemas.in']
    start_urls = ['https://www.spicinemas.in/coimbatore/now-showing']

    def parse(self, response):
        t = threading.Thread(self.getDetails(response))
        t.start()

    def getDetails(self, response):
        while True:
            records = response.xpath('//section[@class="main-section"]/section[2]/section[@class="movie__listing now-showing"]/ul/li/div/dl/dt/a/text()').extract()
            if 'NGK' in str(records):
                try:
                    print("Booking Opened",datetime.datetime.now())
                    winsound.PlaySound('alert.wav', winsound.SND_FILENAME)
                except Exception:
                    print ("Error: unable to play sound")
            else:
                print("Booking Not Opened",datetime.datetime.now())
            time.sleep(10)

If you run the code now, it says booking opened. but I need to get the webpage scraped at every while loop. How can I do that?

Update #1:

I'm getting these trace when running using the solution given below

File "C:\Users\ranji\Documents\Spiders\SpiCinemasSpider\spicinemas_spider\spiders\new_film_spider.py", line 34, in <module>
    main()
  File "C:\Users\ranji\Documents\Spiders\SpiCinemasSpider\spicinemas_spider\spiders\new_film_spider.py", line 30, in main
    process.start()
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\crawler.py", line 293, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 1271, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 1251, in startRunning
    ReactorBase.startRunning(self)
  File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 754, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

The issue you have is because you never requested for the updated version of the "response" data. Instead you are trying to extract a xpath value of the same "response" and expecting it to change, which it will never change once you "spidered" the page. You need to find a way to make scrapy spider the host again. — thuyein, Jun 03 '19 at 02:34
But how can I call it again? i never used start_urls in any of the code. — Ranjith Varatharajan, Jun 03 '19 at 02:36
You need to have a main function outside the class. Then have it create crawl runner instance `from scrapy.crawler import CrawlerRunner`. Then run it every 10 seconds. — thuyein, Jun 03 '19 at 02:39
FYI it's __scraped__ (and __scrape__, __scraping__, __scraper__) not scrapped — DisappointedByUnaccountableMod, Jun 07 '21 at 16:09

thuyein · Answer 1 · 2019-06-03T03:15:31.583

The issue is because the thread is only working on the same set of "response" data every time and expecting it to change. The following is a modified code to show how it can be used to spider every 10 second and check for the xpath value.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request
import time
import datetime
import winsound

class NewFilmSpiderSpider(scrapy.Spider):
    name = 'new_film_spider'
    allowed_domains = ['www.spicinemas.in']
    start_urls = ['https://www.spicinemas.in/coimbatore/now-showing']

    def parse(self, response):
        records = response.xpath('//section[@class="main-section"]/section[2]/section[@class="movie__listing now-showing"]/ul/li/div/dl/dt/a/text()').extract()
        if 'NGK' in str(records):
            try:
                print("Booking Opened",datetime.datetime.now())
                winsound.PlaySound('alert.wav', winsound.SND_FILENAME)
            except Exception:
                print ("Error: unable to play sound")
        else:
            print("Booking Not Opened",datetime.datetime.now())


def main():
    try:
        process = CrawlerProcess()
        process.crawl(NewFilmSpiderSpider)
        process.start()

        while True:
            process.crawl(NewFilmSpiderSpider)
            time.sleep(10)
    except KeyboardInterrupt:
        process.join()


if __name__ == "__main__":
    main()

Reference: https://doc.scrapy.org/en/latest/topics/practices.html , https://stackoverflow.com/a/43480164/1509809

`if __name__ == "__main__": main()` please explain what it means? — Ranjith Varatharajan, Jun 03 '19 at 02:53
Umm, are you new to python? Read about it [here](https://stackoverflow.com/questions/419163/what-does-if-name-main-do) — thuyein, Jun 03 '19 at 02:54
Yes, started learning python. and when I run this. it runs for one time only then it stops. i ran using `scrapy crawl new_film_spider` — Ranjith Varatharajan, Jun 03 '19 at 02:56
With this modified version, you just need to run `python filename.py`. And you should have provided that information in the question. — thuyein, Jun 03 '19 at 02:57
again i'm getting some error and its running one time only, i have the stack trace in the updated question — Ranjith Varatharajan, Jun 03 '19 at 03:02

Scrapy taking old text from webpage

1 Answers1