Trying to use regex to select values between <title> </title>
.
However sometimes these two tags are on different lines.
Trying to use regex to select values between <title> </title>
.
However sometimes these two tags are on different lines.
As the others have stated, it's more powerful and less brittle to use a full fledged markup language parser, like the htmlparser from stdlib or even BeautifulSoup, over regex. Though, since regex seems to be a requirement, maybe something like this will work:
import urllib2
import re
URL = 'http://amazon.com'
page = urllib2.urlopen(URL)
stream = page.readlines()
flag = False
for line in stream:
if re.search("<title>", line):
print line
if not re.search("</title>", line):
flag = True
elif re.search("</title>", line):
print line
flag = False
elif flag == True:
print line
When it finds the <title>
tag it prints the line, checks to make sure the closing tag isn't on the same line, and then continues to print lines until it finds the closing </title>
.
If you can't use a parser, just do it by brute force. Read the HTML doc into the string doc
then:
try:
title = doc.split('<title>')[1].split('</title>')[0]
except IndexError:
## no title tag, handle error as you see fit
Note that if there is an opening title tag without a matching closing tag, the search succeeds. Not a likely scenario in a well-formatted HTML doc, but FYI.