-2

According to this question How Scrapy filters the crawled urls?, there is a file called requests.seen in the directory defined by the JOBDIR variable

Please where can I find the JOBDIR variable ?

Community
  • 1
  • 1
Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253

1 Answers1

2

According to official tutorial(Jobs: pausing and resuming crawls) JOBDIR can be set from command line:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1
ndpu
  • 22,225
  • 6
  • 54
  • 69
  • I run my spider, and Yes the file has been generated, but when I open it I didn't find the scraped URLs. Instead, I found lines like this `f6b696ffa8fbcd8fbd4eff777ba677091858a9c7` why please? – Marco Dinatsoli Jan 27 '14 at 16:16
  • is that the finger print of a scraped URL please? – Marco Dinatsoli Jan 27 '14 at 16:17
  • @MarcoDinatsoli in this directory scrapy will be storing all required data to keep the state of a single job (ie. a spider run), i.e. counters, offsets but not scraped lists of urls... – ndpu Jan 27 '14 at 16:28
  • what i am looking for, is the scraped a list of urls, where can i find it please? i have a sence that this file contains it – Marco Dinatsoli Jan 27 '14 at 16:34
  • @MarcoDinatsoli look here http://stackoverflow.com/questions/3871613/scrapy-how-to-identify-already-scraped-urls or similar questions – ndpu Jan 27 '14 at 18:17
  • How could I set the JOBDIR from a script? – William Kinaan Mar 02 '14 at 19:32