python regex findall

Question

I wanna find all thing between <span class=""> and </span>

p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
text = re.findall(p, z)

for example in this case <span class="">foo</span> expected return foo but it returns any thing !!! why my code goes wrong ?

Cheers

[Use an XML parser](http://stackoverflow.com/a/1732454/647772) — , Sep 01 '12 at 15:36
don't use regex to parse HTML, use an XML/HTML parser instead — gefei, Sep 01 '12 at 15:37
What do you mean by "it returns anything"? Provide a runnable example with traceback. As it is your code should work as my answer shows. — Mark Tolonen, Sep 01 '12 at 15:42
I think for this simple scenario, you might get away with a simple regex. As Mark shows, your regex should work. It would fail, however, if there were any newlines inside the `` tag. You'd need to compile the regex using `re.I|re.S`. — Tim Pietzcker, Sep 01 '12 at 15:44

score 4 · Accepted Answer · edited May 23 '17 at 11:56

4

Since HTML is not a regular language, you really should use an XML parser instead.

Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

edited May 23 '17 at 11:56

Community

1
1

answered Sep 01 '12 at 15:39

Martijn Pieters

1,048,767
296
4,058
3,343

"Regular languages" have nothing to do with "regular expressions" (in the CS sense). See http://stackoverflow.com/q/11306641/989121 – georg Sep 01 '12 at 15:42
@thg435: I was more quoting the linked answer than putting much thought into the CS sense of the sentence. Interesting answer, but I find that using a dedicated HTML parser is often the more readable and more maintainable tool for problems like the OPs. :-) – Martijn Pieters Sep 01 '12 at 15:46
of course he should use a parser. I just wanted to point out that the term "regular language" is not relevant here. The prominent Funny Post is painfully wrong in this regard. – georg Sep 01 '12 at 16:06
@thg435: Noted, I'll avoid the term in the future. – Martijn Pieters Sep 01 '12 at 16:06

Mark Tolonen · Answer 2 · 2012-09-01T16:02:51.757

Your original code works as is. You should use an HTML parser though.

import re
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
z = '<span class="">foo</span>'
text = re.findall(p, z)
print text

Output:

['foo']

Edit

As Tim points out, re.DOTALL should be used or the below would fail:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated foo</span>'''
text = re.findall(p, z)
print text

Even then it would fail for nested spans:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
text = re.findall(p, z)
print text

Output (failing):

[' a more\ncomplicated<span class="other">other']

So use an HTML parser like BeautifulSoup:

from BeautifulSoup import BeautifulSoup
soup = bs(z)
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
soup = BeautifulSoup(z)
print soup.findAll('span',{'class':''})
print
print soup.findAll('span',{'class':'other'})

Output:

[<span class=""> a more
complicated<span class="other">other</span>foo</span>]

[<span class="other">other</span>]

It would be safer to also specify `re.DOTALL`. Also, all those ugly backslashes can be dropped. — Tim Pietzcker, Sep 01 '12 at 15:45
Sure, and I upvoted your answer for that. But I suspect that the real text contains more than just `foo`, and that that is the actual problem... — Tim Pietzcker, Sep 01 '12 at 15:47
Yeah, well I'm feeling generous today, and my crystal ball has just returned from the cleaner's. — Tim Pietzcker, Sep 01 '12 at 15:48

python regex findall

2 Answers2

Linked