Server
- 6 GB RAM
- 4 Cores Intel Xeon 2.60GHz
- 32 CONCURRENT_REQUESTS
- 1m URLs in CSV
- 700 Mbit/s downstream
- 96% Memory Consumtion
With debug mode on, the scrape stops after around 400 000 urls, most likely because the server runs out of memory. Without debug mode it takes up to 5 days which is pretty slow imo and it takes way to much memory (96%)
any hints are highly welcome :)
import scrapy
import csv
def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
scrapurls = []
for row in data:
scrapurls.append("http://"+row[2])
return scrapurls
class rssitem(scrapy.Item):
sourceurl = scrapy.Field()
rssurl = scrapy.Field()
class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
item = rssitem()
item['sourceurl']=response.url
item['rssurl']=sel.extract()
yield item
pass