How to manage a crawler URL frontier?

Question

Guys

I have the following code to add visited links on my crawler. After extracting links i have a for loop which loop thorough each individual href tags.

And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.

private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());

The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?

Thanks in advance!

score 1 · Answer 1 · edited May 23 '17 at 11:44

1

The most usable way for modern crawling systems is to use NoSQL databases.

This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters

But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)

edited May 23 '17 at 11:44

Community

1
1

answered Nov 18 '15 at 12:18

Dewfy

23,277
13
73
121

Thanks Dewfy ! I wonder when is the variable will be cleared , if i run like 10000 years how much memeory then i need ? how do solve this ? even thoug i use Trie stacture u suggested – Develop4Life Nov 18 '15 at 12:53
1

@danielad from open statistic by Google average length of url is 90 symbols, for today Google reports about 50 billions (5*10^10) web pages, Some Trie implementations state that we can keep this structure with memory efficiency as O(N). Making simple multiplication `90*5*10^10 = 4.5*10^12`(bytes) = 4191(Gb) Not so large number for modern computer – Dewfy Nov 18 '15 at 14:28

score 1 · Accepted Answer · answered Oct 11 '21 at 19:32

If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.

Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.

See https://github.com/crawler-commons/url-frontier

score -1 · Answer 3 · answered Nov 18 '15 at 12:54

As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

How to manage a crawler URL frontier?

3 Answers3