Python Regex Help

Question

I am trying to sort through HTML tags and I can't seem to get it right.

What I have done so far

import urllib
import re

s = raw_input('Enter URL: ')
f = urllib.urlopen(s) 
s = f.read() 
f.close 
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)

Where I replace "TAG" with tag I want to see.

Thanks in advance.

Use an XML parser to parse HTML. Mandatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Sven Marnach, Jan 31 '11 at 22:05
Don't parse HTML with regex. Regex is an insufficiently complex tool to parse HTML. If someone is asking you to do this, beat them over the head with a stick and then use BeautifulSoup instead. It'll be less painful for the both of you. — Chinmay Kanchi, Jan 31 '11 at 22:27
It is not a good idea to use a xml parser if you are scanning html from the web some html page are very far from a xml compliant file. — VGE, Feb 01 '11 at 08:27

score 5 · Answer 1 · answered Jan 31 '11 at 22:00

5

You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.

answered Jan 31 '11 at 22:00

Miguel

51
1

2

BeautifulSoup is _perfect_ for this. – atp Jan 31 '11 at 22:03
This is a regex learning experience and it was the only real example I could come up with. I.E. if dog.avicat.avipig.jpg is there anyway I could sort it out to be dog.avi cat.avi pig.jpg – Krayons Jan 31 '11 at 22:09

score 1 · Answer 2 · answered Jan 31 '11 at 22:22

1

I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with

re.compile("<div.*?>.*?</div>")

Although you will run into some problems with nested divs with the above one.

answered Jan 31 '11 at 22:22

Matti Lyra

12,828
8
49
67

I think you forgot the parentheses around the capture group. Did you mean: re.compile("(.*?)")? – Eli Jan 31 '11 at 22:37
He's not using the capture group in his code for anything so I thought it wasn't needed. I'm sure an able programmer can add it should the need for one arise. – Matti Lyra Jan 31 '11 at 23:02

score 1 · Accepted Answer · answered Jan 31 '11 at 23:38

An example from BS is this

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]

As for a regular expression, you can use

aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']

Python Regex Help

3 Answers3