I need to extract the 10 most frequent words from a text using a pipe (and any additional python scripts as needed); output being a block of all-caps words separated by a space. This pipe needs to extract text from any external file: I've managed to get it to work on .txt files, but I also need to be able to input a URL and have it do the same thing with that.
I have the following code:
alias words="tr a-zA-Z | tr -cs A-Z | tr ' ' '\012' | sort -n | uniq -c |
sort -r | head -n 10 | awk '{printf \"%s \", \$2}END{print \"\"}'" (on one line)
which, with cat hamlet.txt | words
gives me:
TO THE AND A 'TIS THAT OR OF IS
To make it more complicated, I need to exclude any 'function' words: these are 'non-lexical' words like 'a', 'the', 'of', 'is', any pronouns (I, you, him), and any prepositions (there, at, from).
I need to be able to type htmlstrip http://www.google.com.au | words
and have it print out like the above.
For the URL-opening: The python script I'm trying to figure out (let's call it htmlstrip) strips any tags from the text, leaving only 'human readable' text. This should be able to open any given URL, but I can't figure out how to get this to work. What I have so far:
import re
import urllib2
filename = raw_input('File name: ')
filehandle = open(filename)
html = filehandle.read()
f = urllib2.urlopen('http://') #???
print f.read()
text = [ ]
inTag = False
for ch in html:
if ch == '<':
inTag = True
if not inTag:
text.append(ch)
if ch == '>':
inTag = False
print ''.join(text)
I know this is both incomplete and probably incorrect - any guidance would really be appreciated.