0

First of all i have been working with Python since a month and i created an application. This app needs score results. It collects home, away and today matches datas. I use Scrapy to collect these datas without a problem!

Scrapy creates 3 json data : home.json, away.json and today.json

import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor

class Home(scrapy.Spider):

runnerHome = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\Home.json": {"format": "json", "overwrite": True}
        },
        })
class Away(scrapy.Spider):

runnerAway = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\Away.json": {"format": "json", "overwrite": True}
        },
        })
class Today(scrapy.Spider):

runnerToday = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\Today.json": {"format": "json", "overwrite": True}
        },
        })

@defer.inlineCallbacks
def crawl():
   yield runnerHome.crawl(Home)
   yield runnerHome.crawl(Away)
   yield runnerHome.crawl(Today)
   reactor.stop()

crawl()
reactor.run()

In this structure : running the spiders sequentially by chaining the deferreds Above Code block works excellent without problem!

My second code creates a single usable data file (data.json) by using home.json, away.json and today.json

def data():
    
    # Ham scrapy dosyasi okundu!

    dosya = open ("C:\\Users\Messi\\Home.json")
    homeVeriler = json.load(dosya)
    dosya.close()

    dosya = open ("C:\\Users\Messi\\Away.json")
    awayVeriler = json.load(dosya)
    dosya.close()

    dosya = open ("C:\\Users\Messi\\Today.json")
    todayVeriler = json.load(dosya)
    dosya.close()

   # Some calculations and creates data.json

    with open ("C:\\Users\Messi\\data.json", "w") as dosya:
        
        json.dump(veriler, dosya)

Oke What i want?

while scrapy scrapes the pages, it sometimes skips some lines so when i run data() manually which is in the other py file, in this time i got error. I need a loop system with a schedule. I want, if there is a problem on scraping, so the data() that will create an error so please try again scraping and retry data() file please. And do this 12:05 am every day :D

First home.json file is created, then Away.json, lastly Today.json I changed a line at Home.json so the Scrapy did not scan Home.json again. There is a problem at While Loop. yenile() did not start Scrapy again.

@defer.inlineCallbacks
def crawl():
   yield runnerHome.crawl(LivescoresHome)
   yield runnerAway.crawl(LivescoresAway)
   yield runnerToday.crawl(LivescoresToday)
   reactor.stop()

def yenile():
    crawl()
    reactor.run()   


while True:
    try: 
        yenile()            
        data()
        break
    except:
        pass  

What should be the True loop structure?

Thanks very much. I love Stackoverflow

  • You cannot start a new twisted reactor from within the same process once the crawl has already finished. You would be better off fixing the spider so that it doesn't skip lines, or switch to using subprocess calls for running each of your spiders. – Alexander Oct 14 '22 at 17:38
  • I made a reseach https://stackoverflow.com/questions/44228851/scrapy-on-a-schedule/44230214#44230214 I know this method But this time i need 2 VPS. One scraps, other compiles. :/ –  Oct 14 '22 at 18:23
  • 1
    Yes, I understood your question... I am telling just like it says in the link answer that you cannot restart the reactor – Alexander Oct 14 '22 at 18:26

0 Answers0