Generalised text extraction from web pages using regex and python

Question

I need to extract text from any kind of web page using regex in python. My code works fine with html tags but due to irregular syntax of tags and attributes enclosed between script tags, the code I came up with extracts some scripting data in addition to the useful text. Is there a way to avoid that?

def TextExtract():  
    page=urllib.urlopen(URL).read()    
    print "TEXT: "  
    for m in re.finditer("(?#extracts <TAG>TEXT till next <)(?s)<(?=[^!--]).+?>.*?(?=<)",page):  
        if(m!=None):  
            ##print m.group(),"\n"  
            l=re.search("(?#extracts TEXT between > and <)(?s)(?<=>).*",m.group())  
            n=re.search("(?#discards script and style tags)(?s)(<style.*)|(<script.*)",m.group())  
            if(n==None):  
                print l.group()

score 0 · Accepted Answer · edited May 23 '17 at 11:55

0

Don't parse html with regex. Use the popular Python library lxml.html instead.

edited May 23 '17 at 11:55

Community

1
1

answered Mar 03 '12 at 01:24

Linus Thiel

38,647
9
109
104

Generalised text extraction from web pages using regex and python

1 Answers1