5

I have a >100,000 urls (different domains) in a list that I want to download and save in a database for further processing and tinkering.

Would it be wise to use scrapy instead of python's multiprocessing / multithreading? If yes, how do I write a standalone script to do the same?

Also, feel free to suggest other awesome approaches that come to your mind.

Anuvrat Parashar
  • 2,960
  • 5
  • 28
  • 55

4 Answers4

2

Scrapy does not seem relevant here if you know very well the URL to fetch (there's is no crawling involved here).

The easiest way that comes to mind would be to use Requests. However, querying each URL in a sequence and block waiting for answers wouldn't be efficient, so you could consider GRequests to send batches of requests asynchronously.

icecrime
  • 74,451
  • 13
  • 99
  • 111
  • 1
    Its one of those things, you can't imagine you live without, after you come across them. Thanks a ton for introducing me to Grequests. – Anuvrat Parashar Jun 06 '13 at 09:20
0

Most site owners try to block you crawler if you suddenly create hi-load.

So even if you have fixed list of links you need control timeouts, http answer codes, proxies and etc. on scrapy or grab

b1_
  • 2,069
  • 1
  • 27
  • 39
0

Scrapy is still an option.

  1. Speed/performance/efficiency

    Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.

  2. Database pipelining

    You mentioned that you want your data to be pipelined into the database - as you may know Scrapy has Item Pipelines feature:

    After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

    So, each page can be written to the database immediately after it has been downloaded.

  3. Code organization

    Scrapy offers you a nice and clear project structure, there you have settings, spiders, items, pipelines etc separated logically. Even that makes your code clearer and easier to support and understand.

  4. Time to code

    Scrapy does a lot of work for you behind the scenes. This makes you focus on the actual code and logic itself and not to think about the "metal" part: creating processes, threads etc.

But, at the same time, Scrapy might be an overhead. Remember that Scrapy was designed (and great at) to crawl, scrape the data from the web page. If you want just to download a bunch of pages without looking into them - then yes, grequests is a good alternative.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I have worked with scrapy and am aware of the benefits. I was more interested in how to write a program which would use scrapy as a library instead of being bound by the scrapy framework's project structure? – Anuvrat Parashar Jun 07 '13 at 14:55
  • Sure, wanted to point that out anyway. You don't have to make that project structure to create and run your spiders. E.g. http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script. – alecxe Jun 07 '13 at 17:54
0

AFAIK, with Scrapy, it's not possible if the URL list does not fit in memory.

This should be possible to do with minet:

minet fetch url_column urls.csv > report.csv
damio
  • 6,041
  • 3
  • 39
  • 58