Crawling a large site, handling timeouts

Question

I hope you can help me. I am trying to crawl a website with 4500 links in it containing information. So the structure is like this:

Tier 1 (just different categories)
Tier 2 (Containing different Topics)
Tier 3 (Containing Topic Information)

So my script opens each category in a loop - then opens topic by topic and extracts all the information from the Tier 3. But since there are like 4500 Topics, I have problems that I sometimes have a time out error and after this I have to try from beginning(Sometimes after 200 topics, and other time it was after 2200 topics). My question is how can I do it the right way so if it crashes I can proceed with the next topic where it crashed before and not from the beginning. I am new to Ruby and Crawling and would appreciate every single advice.

Thanks!

score 1 · Answer 1 · edited May 23 '17 at 11:50

This sort of question pops up periodically on Stack Overflow. There are a number of things to take into account when writing a single-page scraper, or a whole-site spider.

See "DRY search every page of a site with nokogiri" and "What are some good Ruby-based web crawlers?" and "What are the key considerations when creating a web crawler?" for more information. Those cover a good number of things I do when I'm writing spiders.

score 0 · Accepted Answer · answered Jun 12 '13 at 10:31

0

You definitely should split your parser routine, and save temporary data into DB simultaneously.

My approach would be:

Crawl Tier 1 to gather categories. Save them into temporary DB.
Using the DB, crawl Tier 2 to gather list of topics. Save them into DB.
Using the DB, crawl Tier 3 to fetch actual contents. Save them into DB, skip/retry if error occurs.

answered Jun 12 '13 at 10:31

saki7

664
6
14

The "temporary DB" could be an actual relational databases, or you could just use Ruby's `Hash` / `Array`. – saki7 Jun 12 '13 at 10:33
Are the temporary DB stored somewhere or are they only filled every time again and again. So for example if I made the first 2 Steps and have those and then step 3 fails. If I restart should be steps 1 & 2 made one more time or should only be the step 3 restarted. If so I have to save where it failed? – user2448801 Jun 12 '13 at 12:56
That depends entirely on your choice and design thoughts. If you desire to fallback from errors after an application restart, you must store it on a permanent storage; otherwise you could store it on memory. In my opinion I would choose to store it on a simple JSON file. ActiveSupport provides `to_json` method for hashes and arrays. Further question related to the application design will be too generic. You must ask a more specific question in a new thread. – saki7 Jun 12 '13 at 13:24
1

Please do not delete this thread even if you have further questions; this thread might still be helpful for some users. You can click on the check button on my answer to mark it accepted. – saki7 Jun 12 '13 at 13:27
Don't save data in memory. It will disappear when the app stops, either because of a crash or a deliberate break, and all information will be lost and the scan will have to start at the beginning. At a minimum use SQLite writing to a disk-based DB. Don't use JSON, as it's not made for random access, which is very important. JSON is a data serialization format, not a data store. Actually, JSON as a data store is a really bad suggestion on many levels. – the Tin Man Jun 13 '13 at 20:14
The superiority of JSON file is that **it is a file**. This means that it could be transferred for caching, and since *it is a file*, you can put it on a Git repository, and share it among several job servers. Basically the temporary data you get when crawling sites are not that useful to store it in an actual DB. Caching is just enough, DB is just an overkill. – saki7 Jun 14 '13 at 09:36
I completely disagree that the cached data is temporary data and not useful. A table might contain cache or state information, and throwing it away means you have to start from the beginning. When I have a job running for days or weeks that isn't an option. Storing multiple tables in JSON, and trying to share those among job servers is much more difficult; The second one file changes everything is out of sync. Having them access a single shared database is core to data-processing and behind huge systems everywhere. – the Tin Man Jun 16 '13 at 01:40

Crawling a large site, handling timeouts

2 Answers2