0

I need to extract text from any kind of web page using regex in python. My code works fine with html tags but due to irregular syntax of tags and attributes enclosed between script tags, the code I came up with extracts some scripting data in addition to the useful text. Is there a way to avoid that?

def TextExtract():  
    page=urllib.urlopen(URL).read()    
    print "TEXT: "  
    for m in re.finditer("(?#extracts <TAG>TEXT till next <)(?s)<(?=[^!--]).+?>.*?(?=<)",page):  
        if(m!=None):  
            ##print m.group(),"\n"  
            l=re.search("(?#extracts TEXT between > and <)(?s)(?<=>).*",m.group())  
            n=re.search("(?#discards script and style tags)(?s)(<style.*)|(<script.*)",m.group())  
            if(n==None):  
                print l.group()  

1 Answers1

0

Don't parse html with regex. Use the popular Python library lxml.html instead.

Community
  • 1
  • 1
Linus Thiel
  • 38,647
  • 9
  • 109
  • 104