What is the best way to download number of pages from a list of urls?

Question

I have a >100,000 urls (different domains) in a list that I want to download and save in a database for further processing and tinkering.

Would it be wise to use scrapy instead of python's multiprocessing / multithreading? If yes, how do I write a standalone script to do the same?

Also, feel free to suggest other awesome approaches that come to your mind.

score 2 · Accepted Answer · answered Jun 06 '13 at 08:48

2

Scrapy does not seem relevant here if you know very well the URL to fetch (there's is no crawling involved here).

The easiest way that comes to mind would be to use Requests. However, querying each URL in a sequence and block waiting for answers wouldn't be efficient, so you could consider GRequests to send batches of requests asynchronously.

answered Jun 06 '13 at 08:48

icecrime

74,451
13
99
111

1

Its one of those things, you can't imagine you live without, after you come across them. Thanks a ton for introducing me to Grequests. – Anuvrat Parashar Jun 06 '13 at 09:20

score 0 · Answer 2 · answered Jun 06 '13 at 10:07

0

Most site owners try to block you crawler if you suddenly create hi-load.

So even if you have fixed list of links you need control timeouts, http answer codes, proxies and etc. on scrapy or grab

answered Jun 06 '13 at 10:07

b1_

2,069
1
27
39

score 0 · Answer 3 · answered Jun 06 '13 at 12:05

Scrapy is still an option.

Speed/performance/efficiency

Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.
Database pipelining

You mentioned that you want your data to be pipelined into the database - as you may know Scrapy has Item Pipelines feature:

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

So, each page can be written to the database immediately after it has been downloaded.
Code organization

Scrapy offers you a nice and clear project structure, there you have settings, spiders, items, pipelines etc separated logically. Even that makes your code clearer and easier to support and understand.
Time to code

Scrapy does a lot of work for you behind the scenes. This makes you focus on the actual code and logic itself and not to think about the "metal" part: creating processes, threads etc.

But, at the same time, Scrapy might be an overhead. Remember that Scrapy was designed (and great at) to crawl, scrape the data from the web page. If you want just to download a bunch of pages without looking into them - then yes, grequests is a good alternative.

I have worked with scrapy and am aware of the benefits. I was more interested in how to write a program which would use scrapy as a library instead of being bound by the scrapy framework's project structure? — Anuvrat Parashar, Jun 07 '13 at 14:55
Sure, wanted to point that out anyway. You don't have to make that project structure to create and run your spiders. E.g. http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script. — alecxe, Jun 07 '13 at 17:54

score 0 · Answer 4 · answered Nov 21 '19 at 22:36

0

AFAIK, with Scrapy, it's not possible if the URL list does not fit in memory.

This should be possible to do with minet:

minet fetch url_column urls.csv > report.csv

answered Nov 21 '19 at 22:36

damio

6,041
3
39
58

What is the best way to download number of pages from a list of urls?

4 Answers4