Regex Selection of Strings

Question

Trying to use regex to select values between <title> </title>.

However sometimes these two tags are on different lines.

Check out this post: http://stackoverflow.com/a/1732454/1290264 — bcorso, May 06 '14 at 00:42
@bcorso This is not as egregious as the usual case -- per http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-title-element , `` can't have any attributes, nor may it be nested within itself. — zwol, May 06 '14 at 00:58
That said, in Python there is no reason not to use the perfectly good [HTML parser](https://docs.python.org/2/library/htmlparser.html) from the standard library. Everything will be easier. — zwol, May 06 '14 at 01:00
@zack Unfortunately my task does not permit me to import extra modules. Is there no simple way to have regex search a block of html and show me the values it finds between the HTML tags Title/ — Matt Biggs, May 06 '14 at 01:16
Wait, if you can't `import`, how are you using regular expressions? You need `import re` for those. — Leigh, May 06 '14 at 01:37

score 1 · Accepted Answer · answered May 06 '14 at 01:51

As the others have stated, it's more powerful and less brittle to use a full fledged markup language parser, like the htmlparser from stdlib or even BeautifulSoup, over regex. Though, since regex seems to be a requirement, maybe something like this will work:

import urllib2
import re

URL = 'http://amazon.com'
page = urllib2.urlopen(URL)
stream = page.readlines()
flag = False
for line in stream:
    if re.search("<title>", line):
        print line
        if not re.search("</title>", line):
            flag = True
    elif re.search("</title>", line):
        print line
        flag = False
    elif flag == True:
        print line

When it finds the <title> tag it prints the line, checks to make sure the closing tag isn't on the same line, and then continues to print lines until it finds the closing </title>.

Chris Johnson · Answer 2 · 2014-05-06T14:38:22.173

1

If you can't use a parser, just do it by brute force. Read the HTML doc into the string doc then:

try:
    title = doc.split('<title>')[1].split('</title>')[0]
except IndexError:
    ## no title tag, handle error as you see fit

Note that if there is an opening title tag without a matching closing tag, the search succeeds. Not a likely scenario in a well-formatted HTML doc, but FYI.

edited May 06 '14 at 14:38

answered May 06 '14 at 02:10

Chris Johnson

20,650
6
81
80

Regex Selection of Strings

2 Answers2