-1

Trying to use regex to select values between <title> </title>.

However sometimes these two tags are on different lines.

rypel
  • 4,686
  • 2
  • 25
  • 36
Matt Biggs
  • 179
  • 1
  • 4
  • 15
  • 2
    Check out this post: http://stackoverflow.com/a/1732454/1290264 – bcorso May 06 '14 at 00:42
  • @bcorso This is not as egregious as the usual case -- per http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-title-element , `` can't have any attributes, nor may it be nested within itself. – zwol May 06 '14 at 00:58
  • That said, in Python there is no reason not to use the perfectly good [HTML parser](https://docs.python.org/2/library/htmlparser.html) from the standard library. Everything will be easier. – zwol May 06 '14 at 01:00
  • @zack Unfortunately my task does not permit me to import extra modules. Is there no simple way to have regex search a block of html and show me the values it finds between the HTML tags Title/ – Matt Biggs May 06 '14 at 01:16
  • Wait, if you can't `import`, how are you using regular expressions? You need `import re` for those. – Leigh May 06 '14 at 01:37

2 Answers2

1

As the others have stated, it's more powerful and less brittle to use a full fledged markup language parser, like the htmlparser from stdlib or even BeautifulSoup, over regex. Though, since regex seems to be a requirement, maybe something like this will work:

import urllib2
import re

URL = 'http://amazon.com'
page = urllib2.urlopen(URL)
stream = page.readlines()
flag = False
for line in stream:
    if re.search("<title>", line):
        print line
        if not re.search("</title>", line):
            flag = True
    elif re.search("</title>", line):
        print line
        flag = False
    elif flag == True:
        print line

When it finds the <title> tag it prints the line, checks to make sure the closing tag isn't on the same line, and then continues to print lines until it finds the closing </title>.

cmrust
  • 366
  • 2
  • 7
1

If you can't use a parser, just do it by brute force. Read the HTML doc into the string doc then:

try:
    title = doc.split('<title>')[1].split('</title>')[0]
except IndexError:
    ## no title tag, handle error as you see fit

Note that if there is an opening title tag without a matching closing tag, the search succeeds. Not a likely scenario in a well-formatted HTML doc, but FYI.

Chris Johnson
  • 20,650
  • 6
  • 81
  • 80