0

I'm making a script that calculate the distribution of the words in the web. What I have to do is check as many random web sites as I can and count the number of words in those sites, list them, and order them so that the word that occurs the majority of the times is a the top of the list. What I'm doing is generating random ip numbers:

a=`expr $RANDOM % 255`
let "a+=1"
b=`expr $RANDOM % 256`
c=`expr $RANDOM % 256`
d=`expr $RANDOM % 256`
ip=$a.$b.$c.$d

after that with nmap I check if the port 80 or 8080 is open on those sites so that there is a chance that it's a web site.

if I'm sure the ip doesn't belong to a web site I add the address to a black list file so that it doesn't get checked again.

If the port 80 or the port 8080 is open then I have to resolve the ip with a reverse lookup and get all the domain names that belong to that ip.

the problem is that if I do one of these commands, the output is only the single PTR record, while there can be multiple:

dig -x ipaddres +short
nslookup ipaddress
host ipaddress

I prefere this to be solved in bash, but if there is solution in C, it could help as well

After that I copy the web site page to a file using w3m and I count the word occurrences.

Also here I have another problem, is there a way to check al the available public pages that belong to the site and not only the index one?

Any help is appreciated

Jason Orendorff
  • 42,793
  • 6
  • 62
  • 96
Epilogue
  • 63
  • 1
  • 8
  • Port scanning is generally frowned upon. While this is small-scale and restricted to just port 80, I don't think it's realistic that you would get caught; but I would certainly advise you to at least check what your ISP's Acceptable Use Policy has to say about this. You probably don't want to risk your Internet access for this. Anyway, concur with Andrew Mao's answer; "you're doing it wrong". – tripleee Aug 04 '12 at 06:32
  • A reasonably random set of web pages could be obtained by pulling (say) the sixth Google hit from the search results for each of a set of random dictionary words. That way, you are also somewhat less likely to end up on sites not in English, which I presume you want to restrict yourself to. Use multiple English words in each search to reduce the likelyhood of false positives. (Still, e.g. "anaconda hat" could be a number of languages other than English. Using longer words only might help, and probably doesn't skew results too much; or include "the" and "of" as search terms in each query?) – tripleee Aug 04 '12 at 06:41
  • ... Hmm, googling for two rare words is going to skew the results heavily towards long documents such as dictionary lists, so don't do that, after all. – tripleee Aug 04 '12 at 06:45
  • possible duplicate of [bash script: word occurrences in web sites](http://stackoverflow.com/questions/11804318/bash-script-word-occurrences-in-web-sites) – Ivan Nevostruev Aug 04 '12 at 19:23
  • I was surprised to find that there's an algorithm for picking an approximately-uniformly random web page. However, it's not ideal for beginners: step 1 is “crawl the web for a while”, and the algorithm takes more time and memory the more uniform you want the distribution to be. http://dpennock.com/papers/rusmevichientong-aaai-fall-2001-uniform.pdf – Jason Orendorff Sep 26 '12 at 16:03

1 Answers1

3

A lot of websites are not accessible purely by the IP address, due to virtual hosts and such. I'm not sure you'd be getting a uniform distribution of words on the web by doing this. Moreover IP addresses that host websites are not really evenly distributed over by randomly generating 32-bit numbers. Hosting companies with the majority of real websites will be concentrated in small ranges, and a lot of other IPs will be endpoints of ISPs with probably nothing hosted.

Given the above, and the problem you are trying to solve, I would actually recommend getting a distribution of URLs to crawl and computing the word frequency on those. A good tool for doing that would be something like WWW:Mechanize in Python, Perl, Ruby, etc. As your limiting factor is going to be your internet connection and not your processing speed, there's no advantage to doing this in a low-level language. This way, you'll have a higher chance of hitting multiple sites at the same IP.

Andrew Mao
  • 35,740
  • 23
  • 143
  • 224
  • thanks but unfortunally I need to do the majority of the work in shell programming, as this is an homework I need to do for the university... To be honest, I know what you mean but it doesn't really have to be that accurate. Actually I just have to find as many sites as I can and count the words. The problem is that as you guys know the majority of the IPs contain more than a single domain and I can't find a way to get all of them, as the commands I listed in my post give only the canonic name of one site :( – Epilogue Aug 04 '12 at 02:57
  • also sorry, but the IP adresses for web sites not being evenly distributed, is there a way to check what ip ranges the most important hosting companies give? – Epilogue Aug 04 '12 at 03:02
  • And a random IP approach won't work e.g. on corporate networks with an HTTP proxy. – Basile Starynkevitch Aug 04 '12 at 06:02
  • yes you guys are right. I was thinking another approach. what if I make random searches using some sort of a dictionary through google? The dictionary would start empty at the beginning and each time I perform a search, I check one site and add to the dictionary only the words that occur once, so that this won't send me to that site again. Is it possible? – Epilogue Aug 04 '12 at 11:15
  • If you are doing shell programming, can you execute things like wget, sed, awk, grep, etc? That might make it easier than trying to do it all in bash. You can fetch webpages and grep them for links. Perhaps use the pagerank method and just follow a Markov process for counting words. – Andrew Mao Aug 06 '12 at 17:19