1

Hello Guys! I am going to write website crawler, which starts from root address then crawls every found link (only internal links). So I face this problem: Crawler must start from root, then it should parse web page (root page) and then get all links. While getting links, it should not crawl the same page twice. Guys is there any good data structure or do I need to use SQL or someother indexing data structures?

user873286
  • 7,799
  • 7
  • 30
  • 38

2 Answers2

4

The data structure you are probably looking for is the Tree.

In the case of a crawler though, it would not be needed because starting from the root, you can maintain a "list" of visited URLs and every time you are about to follow a link, check if it has been encountered before. If it has not been encountered then add it to the list and follow it.

It doesn't have to literally be a list (i.e. array) though, it can be a dictionary or other data structure that would help in speeding up the search.

It could also be an SQL database or something like a key-value storage like redis. If you use something like this then all the indexing and querying will be handed for you by the database system with which you could communicate through a standard method (SQL, special API, other).

However, that's the easy part, there are a few more important things to take into account in "crawling" and perhaps it would be best to first check if what you are trying to do can be done with an already available crawler.

A_A
  • 2,326
  • 18
  • 27
2

I'd recommend checking out my answer here: Designing a web crawler and How can I bring google-like recrawling in my application(web or console)

I've answered a lot of the questions you're asking. The key thing to walk away with is that crawlers use a URL-Seen Test in order to efficiently determine if they should crawl a link. The URL-Seen Test is usually implemented using a map structure which quickly resolves keys (urls). Commonly used solutions are embedded databases such as leveldb, berkeleydb, and other NOSQL solutions.

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226