3

Hey I'm working on a Python project that requires I look through a webpage. I want to look through to find a specific text and if it finds the text, then it prints something out. If not, it prints out an error message. I've already tried with different modules such as libxml but I can't figure out how I would do it.

Could anybody lend some help?

AustinM
  • 773
  • 6
  • 18
  • 27
  • Do you have to search in the entire web page (including HTML tags) or only in the text you can see when you visit the page with a browser? – frm Feb 07 '11 at 20:14

2 Answers2

4

You could do something simple like:


import urllib2
import re

html_content = urllib2.urlopen('http://www.domain.com').read()

matches = re.findall('regex of string to find', html_content);

if len(matches) == 0: 
   print 'I did not find anything'
else:
   print 'My string is in the html'
dplouffe
  • 565
  • 3
  • 5
  • Regex is not the right tool,when it comes to search/parse (x)html.http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – snippsat Feb 07 '11 at 21:11
  • 1
    If you want to parse the DOM, sure I agree that regex is not the correct approach. That said, if you want to find a snippet of text on any text blob, I suggest using regular expressions. Whether the text is html or not it doesn't really matter if you're looking for a specific pattern. – dplouffe Feb 08 '11 at 21:01
  • @dplouffe this post is many years old, would you know if this is still the best option for Python? – Azurespot Jun 23 '20 at 22:52
3

lxml is awesome: http://lxml.de/parsing.html

I use it regularly with xpath for extracting data from the html.

The other option is http://www.crummy.com/software/BeautifulSoup/ which is great as well.

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
Bassdread
  • 350
  • 2
  • 6