Parse XML using regular expressions

Question

I want to parse some tags.

and the pattern is

<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>

I thought it works

re.findall(">"."</a></div>")

but it wasn't

what's wrong with that?

------------ Update I ------------- now i know re is not good with html.

raj give me a answer

>>> from bs4 import BeautifulSoup
>>> s = '<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>'
>>> soup = BeautifulSoup(s)
>>> soup.select('div > a:first')[0].text
'What_I_Want'

and i have another question. how can i find

<div id blah blah </div>

in entire file?

Sigh. Don't try to parse HTML using regex. http://stackoverflow.com/a/1732454/104349 — Daniel Roseman, Apr 17 '15 at 09:38

Avinash Raj · Accepted Answer · 2015-04-17T09:56:22.587

1

Seems like you're trying to get the text of immediate child tag a of parent tag div.

>>> from bs4 import BeautifulSoup
>>> s = '<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>'
>>> soup = BeautifulSoup(s)
>>> soup.select('div > a:first')[0].text
'What_I_Want'
>>> soup.select('div > a')[0].text
'What_I_Want'

edited Apr 17 '15 at 09:56

answered Apr 17 '15 at 09:44

Avinash Raj

172,303
28
230
274

guy, really thanks, and i have a question about it. how can i search the
from entire html?
– E.Laemas Kim Apr 17 '15 at 09:53
@EfirlusKim update your question. – Avinash Raj Apr 17 '15 at 09:54
@EfirlusKim if this a malformed html file? where is the closing `>` in the opening `div` tag? – Avinash Raj Apr 17 '15 at 10:09
there are so many and
, the only different thing is id="tags" or not – E.Laemas Kim Apr 17 '15 at 10:12
@EfirlusKim accept an answer for this question and then ask a new question with exact input and output you expected. – Avinash Raj Apr 17 '15 at 10:14

score 0 · Answer 2 · answered Apr 17 '15 at 09:42

0

Short answer: you can't

Different short answer: Python XML parser (it even has examples)

answered Apr 17 '15 at 09:42

Davide

301
1
8

Parse XML using regular expressions

2 Answers2