0

Possible Duplicate:
Get random site names in bash

I'm making a program for the university that has to find the occurrences of the words on the web. I need to make an algorithm that finds sites and count the numbers of words used and after it has to record them and sort by how many times they are used. Therefore the most sites my program checks, the better. First of all I was thinking of calculating random IPs, but the problem is that the process takes really too much (I left the computer searching the whole night and it found only 15 sites). I guess this is because site's IPs aren't distributed evenly on the web and most of the IPs belongs to users or other services. Now I had a pair of new approach in mind and I wanted to know what you guys think:

what if I make random searches using some sort of a dictionary through google? The dictionary would start empty at the beginning and each time I perform a search, I check one site and add to the dictionary only the words that occur once, so that this won't send me to that site again, by corrupting the occurrences.

Is this easy?

The first thing I want to do is to search also random pages in the google search and not only the first one, how can this be done? I can't figure out how to calculate the max number of pages for that search and how to directly go to a specific page

thanks

Community
  • 1
  • 1
Epilogue
  • 63
  • 1
  • 8
  • Could you clarify what you mean by: ' I check one site and add to the dictionary only the words that occur once, so that this won't send me to that site again, by corrupting the occurrences'. I do not understand how this can prevent you from visiting a website twice. – WaelJ Aug 04 '12 at 15:11

1 Answers1

0

While I don't think you could (or should) do this in bash alone, take a look at Google Custom Search API and this question. It allows to programmatically query Google search directly.

As for what queries to use, you could resort to picking words randomly from a dictionary file - though that would not give you a uniform distribution as words like 'cat' are more popular than 'epichorial', say. If you require something which takes into account those differences you can use a word frequency dictionary, although that seems to be the point of you research in itself, so perhaps that would not be appropriate.

Community
  • 1
  • 1
WaelJ
  • 2,942
  • 4
  • 22
  • 28