0

I need to extract the 10 most frequent words from a text using a pipe (and any additional python scripts as needed); output being a block of all-caps words separated by a space. This pipe needs to extract text from any external file: I've managed to get it to work on .txt files, but I also need to be able to input a URL and have it do the same thing with that.

I have the following code:

alias words="tr a-zA-Z | tr -cs A-Z | tr ' ' '\012' | sort -n | uniq -c | 
sort -r | head -n 10 | awk '{printf \"%s \", \$2}END{print \"\"}'" (on one line)

which, with cat hamlet.txt | words gives me:

TO THE AND A  'TIS THAT OR OF IS

To make it more complicated, I need to exclude any 'function' words: these are 'non-lexical' words like 'a', 'the', 'of', 'is', any pronouns (I, you, him), and any prepositions (there, at, from).

I need to be able to type htmlstrip http://www.google.com.au | words and have it print out like the above.

For the URL-opening: The python script I'm trying to figure out (let's call it htmlstrip) strips any tags from the text, leaving only 'human readable' text. This should be able to open any given URL, but I can't figure out how to get this to work. What I have so far:

import re
import urllib2
filename = raw_input('File name: ')
filehandle = open(filename)
html = filehandle.read()

f = urllib2.urlopen('http://') #???
print f.read()

text = [ ]
inTag = False


for ch in html:
    if ch == '<':
        inTag = True
    if not inTag:
        text.append(ch)
    if ch == '>':
        inTag = False

print ''.join(text)

I know this is both incomplete and probably incorrect - any guidance would really be appreciated.

user1374310
  • 303
  • 2
  • 5
  • 12
  • You should probably be looking at BeautifulSoup for how to download a HTML page and strip it down to human-readable; http://www.crummy.com/software/BeautifulSoup/. This is a FAQ; see also e.g. http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python and http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python – tripleee May 20 '12 at 15:02
  • ... or if you are not particularly into using Python for this, `lynx -dump http://page.example.com/ | words` – tripleee May 20 '12 at 15:05
  • :( unfortunately, for this particular task I need to use Python, with no external modules. I'll have a look @ the other posts though, thanks! – user1374310 May 20 '12 at 15:16

3 Answers3

0

Use re.sub for this:

import re

text = re.sub(r"<.+>", " ", html)

For special cases such as scripts, you can include a regex such as:

<script.*>.*</script>
Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
  • won't remove inline css and javascript – Jeff May 20 '12 at 15:26
  • @Jeff: No, it won't. I agree with triplee that the best approach here is to use an actual html parser. Apparently the OP doesn't want to use any "external modules", however. – Joel Cornett May 20 '12 at 15:38
0

You can use scrape.py and regular expressions like this:

#!/usr/bin/env python

from scrape import s
import sys, re

if len(sys.argv) < 2:
    print "Usage: words.py url"
    sys.exit(0)

s.go(sys.argv[1]) # fetch content
text = s.doc.text # extract readable text
text = re.sub("\W+", " ", text) # remove all non-word characters and repeating whitespace
print text

And then just: ./words.py http://whatever.com

ToughLuck
  • 26
  • 4
0

UPDATE: Sorry, just read the comment about the pure Python without any additional modules. Yes, in this situation re, I think, will be the best way.

Maybe it'll be easier and more correct to use pycURL rather then to remove tags by re?

from StringIO import StringIO    
import pycurl

url = 'http://www.google.com/'

storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content
mega.venik
  • 648
  • 6
  • 13