1

My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file.

I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection was broken.

When I wanted to use multiple processes I've just got error from twisted: Multiprocessing of Scrapy Spiders in Parallel Processes So this mean I must learn twisted which I would say I deprecated, when compared to asyncio, but this only my opinion.

So I have couples of ideas what to do

  • Fight back and try to learn twisted and implement multiprocessing and with distributed queue with Redis, but I don't feel that scrapy is the right tool for this type of job.
  • Go with pyspider - which has all features that I need (I've never used)
  • Go with nutch - which is so complex (I've never used)
  • Try to build my own distributed crawler, but after crawling 4 websites I've found 4 edge cases: SSL, duplications, timeouts. But it will be easy to add some modifications like: focused crawling.

What solution do you recommend?

Edit1: Sharing code

class ESIndexingPipeline(object):
    def __init__(self):
        # self.text = set()
        self.extracted_type = []
        self.text = OrderedSet()
        import html2text
        self.h = html2text.HTML2Text()
        self.h.ignore_links = True
        self.h.images_to_alt = True

    def process_item(self, item, spider):
        body = item['body']
        body = self.h.handle(str(body, 'utf8')).split('\n')

        first_line = True
        for piece in body:
            piece = piece.strip(' \n\t\r')
            if len(piece) == 0:
                first_line = True
            else:
                e = ''
                if not self.text.empty() and not first_line and not regex.match(piece):
                    e = self.text.pop() + ' '
                e += piece
                self.text.add(e)
                first_line = False

        return item

    def open_spider(self, spider):
        self.target_id = spider.target_id
        self.queue = spider.queue

    def close_spider(self, spider):
        self.text = [e for e in self.text if comprehension_helper(langdetect.detect, e) == 'en']
        if spider.write_to_file:
            self._write_to_file(spider)

    def _write_to_file(self, spider):
        concat = "\n".join(self.text)
        self.queue.put([self.target_id, concat])

And the call:

def execute_crawler_process(targets, write_to_file=True, settings=None, parallel=800, queue=None):
    if settings is None:
        settings = DEFAULT_SPIDER_SETTINGS

    # causes that runners work sequentially
    @defer.inlineCallbacks
    def crawl(runner):
        n_crawlers_batch = 0
        done = 0
        n = float(len(targets))
        for url in targets:
            #print("target: ", url)
            n_crawlers_batch += 1
            r = runner.crawl(
                TextExtractionSpider,
                url=url,
                target_id=url,
                write_to_file=write_to_file,
                queue=queue)
            if n_crawlers_batch == parallel:
                print('joining')
                n_crawlers_batch = 0
                d = runner.join()
                # todo: print before yield
                done += n_crawlers_batch
                yield d  # download rest of data
        if n_crawlers_batch < parallel:
            d = runner.join()
            done += n_crawlers_batch
            yield d

        reactor.stop()

    def f():
        runner = CrawlerProcess(settings)
        crawl(runner)
        reactor.run()

    p = Process(target=f)
    p.start()

Spider is not particularly interesting.

Community
  • 1
  • 1
sacherus
  • 1,614
  • 2
  • 20
  • 27
  • 1
    It's strange that your bottleneck is CPU. I'd believe while scrapping you should spend more of your time doing IO downloading webpages. Parsing html shouldn't be very hard and you can't really make 10000k request at once. Your computer won't let you create that much sockets. It would help you more to show the code of your scrappy. – Loïc Faure-Lacroix Dec 21 '16 at 12:26
  • 2
    1k domains/h ~= 1 domain/3.6s - your bottleneck is most likely NOT the CPU but simply your internet connection. Don't underestimate the speed of your CPU - don't overestimate the speed of your inetConnection. You can check if it really is your CPU by looking at your CPU-Usage. i really doubt that it will be 99%. 5k websites a hour sounds like a reasonable amount to me. – Gewure Dec 21 '16 at 12:26
  • 1
    @Gewure it's more likely that his ISP is thinking that something is fishy with that computer and either cut the connection. (preventing spam bots for example or DDOS protection...) – Loïc Faure-Lacroix Dec 21 '16 at 12:29
  • @LoïcFaure-Lacroix as i said, the InetConnection ;) – Gewure Dec 21 '16 at 12:35
  • 1
    Yeah I can be completely wrong. I'm parsing html to markdown and do language detection in scrapy pipeline. I will do some profiling and try to update my post. – sacherus Dec 21 '16 at 13:14
  • I had all the time 1 thread at 100% usage level and the internet connection at 1MB/s and after it dropped to 100kB (my maximum speed is 4MB/s). This is why I inference that I'm using too much CPU. – sacherus Dec 21 '16 at 13:33

1 Answers1

7

You can use Scrapy-Redis. It is basically a Scrapy spider that fetches URLs to crawl from a queue in Redis. The advantage is that you can start many concurrent spiders so you can crawl faster. All the instances of the spider will pull the URLs from the queue and wait idle when they run out of URLs to crawl. The repository of Scrapy-Redis comes with an example project to implement this.

I use Scrapy-Redis to fire up 64 instances of my crawler to scrape 1 million URLs in around 1 hour.

paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
Carlos Peña
  • 224
  • 2
  • 10