0

I wanna find all thing between <span class=""> and </span>

p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
text = re.findall(p, z)

for example in this case <span class="">foo</span> expected return foo but it returns any thing !!! why my code goes wrong ?

Cheers

user1472850
  • 329
  • 2
  • 6
  • 13
  • 3
    [Use an XML parser](http://stackoverflow.com/a/1732454/647772) –  Sep 01 '12 at 15:36
  • don't use regex to parse HTML, use an XML/HTML parser instead – gefei Sep 01 '12 at 15:37
  • 1
    What do you mean by "it returns anything"? Provide a runnable example with traceback. As it is your code should work as my answer shows. – Mark Tolonen Sep 01 '12 at 15:42
  • Tried Your regex and it works just well ... – Ioan Alexandru Cucu Sep 01 '12 at 15:43
  • I think for this simple scenario, you might get away with a simple regex. As Mark shows, your regex should work. It would fail, however, if there were any newlines inside the `` tag. You'd need to compile the regex using `re.I|re.S`. – Tim Pietzcker Sep 01 '12 at 15:44

2 Answers2

4

Since HTML is not a regular language, you really should use an XML parser instead.

Python has several to choose from:

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • "Regular languages" have nothing to do with "regular expressions" (in the CS sense). See http://stackoverflow.com/q/11306641/989121 – georg Sep 01 '12 at 15:42
  • @thg435: I was more quoting the linked answer than putting much thought into the CS sense of the sentence. Interesting answer, but I find that using a dedicated HTML parser is often the more readable and more maintainable tool for problems like the OPs. :-) – Martijn Pieters Sep 01 '12 at 15:46
  • of course he should use a parser. I just wanted to point out that the term "regular language" is not relevant here. The prominent Funny Post is painfully wrong in this regard. – georg Sep 01 '12 at 16:06
  • @thg435: Noted, I'll avoid the term in the future. – Martijn Pieters Sep 01 '12 at 16:06
2

Your original code works as is. You should use an HTML parser though.

import re
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
z = '<span class="">foo</span>'
text = re.findall(p, z)
print text

Output:

['foo']

Edit

As Tim points out, re.DOTALL should be used or the below would fail:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated foo</span>'''
text = re.findall(p, z)
print text

Even then it would fail for nested spans:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
text = re.findall(p, z)
print text

Output (failing):

[' a more\ncomplicated<span class="other">other']

So use an HTML parser like BeautifulSoup:

from BeautifulSoup import BeautifulSoup
soup = bs(z)
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
soup = BeautifulSoup(z)
print soup.findAll('span',{'class':''})
print
print soup.findAll('span',{'class':'other'})

Output:

[<span class=""> a more
complicated<span class="other">other</span>foo</span>]

[<span class="other">other</span>]
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251