1

I am trying to sort through HTML tags and I can't seem to get it right.

What I have done so far

import urllib
import re

s = raw_input('Enter URL: ')
f = urllib.urlopen(s) 
s = f.read() 
f.close 
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)

Where I replace "TAG" with tag I want to see.

Thanks in advance.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
Krayons
  • 240
  • 2
  • 5
  • 14
  • 3
    Use an XML parser to parse HTML. Mandatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Sven Marnach Jan 31 '11 at 22:05
  • 1
    Don't parse HTML with regex. Regex is an insufficiently complex tool to parse HTML. If someone is asking you to do this, beat them over the head with a stick and then use BeautifulSoup instead. It'll be less painful for the both of you. – Chinmay Kanchi Jan 31 '11 at 22:27
  • What sort of results are you currently getting? – Eli Jan 31 '11 at 22:27
  • It is not a good idea to use a xml parser if you are scanning html from the web some html page are very far from a xml compliant file. – VGE Feb 01 '11 at 08:27

3 Answers3

5

You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.

Miguel
  • 51
  • 1
  • 2
    BeautifulSoup is _perfect_ for this. – atp Jan 31 '11 at 22:03
  • This is a regex learning experience and it was the only real example I could come up with. I.E. if dog.avicat.avipig.jpg is there anyway I could sort it out to be dog.avi cat.avi pig.jpg – Krayons Jan 31 '11 at 22:09
1

I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with

re.compile("<div.*?>.*?</div>")

Although you will run into some problems with nested divs with the above one.

Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
  • I think you forgot the parentheses around the capture group. Did you mean: re.compile("(.*?)")? – Eli Jan 31 '11 at 22:37
  • He's not using the capture group in his code for anything so I thought it wasn't needed. I'm sure an able programmer can add it should the need for one arise. – Matti Lyra Jan 31 '11 at 23:02
1

An example from BS is this

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]

As for a regular expression, you can use

aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']
gerry
  • 1,539
  • 1
  • 12
  • 22