Python: Extracting lines from HTML

Question

I am quite new with Python...

I am writing a code that uses the urllib2 library to search through a certain web page. I am using the command re.findall to search for specific strings on this web page. However, rather than extracting these specific strings, I want to extract THE ENTIRE LINE that these strings occur on.

For example, let's say I'm searching for the word "hello" on a web page that looks like this:

Hello, my name is Bob. I am Bob.

My friend is Jane.

My name is Jane... hello!

I want to extract the lines that contain "hello" in them. (So that means I would want to extract the first line and the third line.) This is what I've been using below, which is obviously wrong because it only extracts the word, not the entire line the word occurs on:

Page_Content = urllib2.urlopen(My_URL).read()
Matches = re.findall("hello", Page_Content)

How would I modify this code to extract the entire line? Would I have to use a for loop of some sort and search line by line? If so, how would I go about doing that?

for line in Page_Content
[code here]

?

please search for "stackoverflow parse html with regular expressions", the long rambling rant. Ok, for the lazy: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Tritium21, Sep 22 '13 at 16:46

score 0 · Answer 1 · answered Sep 22 '13 at 16:50

for the regex issue, you can iterate over the file and use re.search

for line in content:
   if re.search("hello",line):
      print line

or better, compile the re first

val pat = re.compile("hello")    
for line in content:
   if pat.search(line):
       print line

score 0 · Answer 2 · answered Sep 22 '13 at 18:41

I like Eran's approach, but here's another way that uses regex a bit more heavily and avoids using a for loop:

pattern = re.compile("\n.*hello.*\n")
matching_lines = re.findall(pattern, Page_Content)

By surrounding the pattern with \n, we're making sure that we're matching an entire line. The .* is regex for "zero or more of any character," so it will match any line with "hello" in it.

Python: Extracting lines from HTML

2 Answers2