2

Server

  • 6 GB RAM
  • 4 Cores Intel Xeon 2.60GHz
  • 32 CONCURRENT_REQUESTS
  • 1m URLs in CSV
  • 700 Mbit/s downstream
  • 96% Memory Consumtion

With debug mode on, the scrape stops after around 400 000 urls, most likely because the server runs out of memory. Without debug mode it takes up to 5 days which is pretty slow imo and it takes way to much memory (96%)

any hints are highly welcome :)

import scrapy
import csv

def get_urls_from_csv():
    with open('data.csv', newline='') as csv_file:
        data = csv.reader(csv_file, delimiter=',')
        scrapurls = []
        for row in data:
            scrapurls.append("http://"+row[2])
        return scrapurls

class rssitem(scrapy.Item):
    sourceurl = scrapy.Field()
    rssurl = scrapy.Field()


class RssparserSpider(scrapy.Spider):
    name = "rssspider"
    allowed_domains = ["*"]
    start_urls = ()

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

    def parse(self, response):
        res = response.xpath('//link[@type="application/rss+xml"]/@href')
        for sel in res:
            item = rssitem()
            item['sourceurl']=response.url
            item['rssurl']=sel.extract()
            yield item

        pass
nfo
  • 637
  • 2
  • 6
  • 19
  • Well first off you could use a generator to yield each url so you don't store 1 million together, `yield "http://"+row[2]` – Padraic Cunningham Aug 26 '16 at 19:14
  • I agree. The first thing I'd suggest knocking out is storing all the URLs in memory. I'd process them in batch, pulling a few of them at a time. In terms of speed, you can probably use the multiprocessing library and spread the work across processes. – Anfernee Aug 26 '16 at 19:21
  • Also `return (scrapy.http.Request(url=start_url, dont_filter=1) for start_url in get_urls_from_csv())` which again will lazily evaluate and not create a list of 1 million Request objects – Padraic Cunningham Aug 26 '16 at 19:22

2 Answers2

6

As I commented you should use generators to avoid creating lists of objects in memory(what-does-the-yield-keyword-do-in-python), using generators objects are created lazily so you don't create large lists of objects all in memory at once:

def get_urls_from_csv():
    with open('data.csv', newline='') as csv_file:
        data = csv.reader(csv_file, delimiter=',')
        for row in data:
            yield "http://"+row[2]) # yield each url lazily


class rssitem(scrapy.Item):
    sourceurl = scrapy.Field()
    rssurl = scrapy.Field()


class RssparserSpider(scrapy.Spider):
    name = "rssspider"
    allowed_domains = ["*"]
    start_urls = ()

    def start_requests(self):
        # return a generator expresion.
        return (scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv())

    def parse(self, response):
        res = response.xpath('//link[@type="application/rss+xml"]/@href')
        for sel in res:
            item = rssitem()
            item['sourceurl']=response.url
            item['rssurl']=sel.extract()
            yield item

As far as performance goes, what the docs on Broad Crawls suggest is to try to increase concurrency is:

Concurrency is the number of requests that are processed in parallel. There is a global limit and a per-domain limit. The default global concurrency limit in Scrapy is not suitable for crawling many different domains in parallel, so you will want to increase it. How much to increase it will depend on how much CPU you crawler will have available. A good starting point is 100, but the best way to find out is by doing some trials and identifying at what concurrency your Scrapy process gets CPU bounded. For optimum performance, you should pick a concurrency where CPU usage is at 80-90%.

To increase the global concurrency use:

CONCURRENT_REQUESTS = 100

emphasis mine.

Also Increase Twisted IO thread pool maximum size:

Currently Scrapy does DNS resolution in a blocking way with usage of thread pool. With higher concurrency levels the crawling could be slow or even fail hitting DNS resolver timeouts. Possible solution to increase the number of threads handling DNS queries. The DNS queue will be processed faster speeding up establishing of connection and crawling overall.

To increase maximum thread pool size use:

 REACTOR_THREADPOOL_MAXSIZE = 20
Mathieu Dhondt
  • 8,405
  • 5
  • 37
  • 58
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0
import csv
from collections import namedtuple

import scrapy


def get_urls_from_csv():
    with open('data.csv', newline='') as csv_file:
        data = csv.reader(csv_file, delimiter=',')
        for row in data:
            yield row[2]


# if you can use something else than scrapy
rssitem = namedtuple('rssitem', 'sourceurl rssurl')


class RssparserSpider(scrapy.Spider):
    name = "rssspider"
    allowed_domains = ["*"]
    start_urls = ()

    def start_requests(self): # remember that it returns generator
        for start_url in get_urls_from_csv():
            yield scrapy.http.Request(url="http://{}".format(start_url))

    def parse(self, response):
        res = response.xpath('//link[@type="application/rss+xml"]/@href')
        for sel in res:
            yield rssitem(response.url, sel.extract())
        pass
turkus
  • 4,637
  • 2
  • 24
  • 28