Why isn't this regexp working

Question

I have a source code of a webpage formatted like this:

<span class="l r positive-icon">
Turkish
</span>
<span>
The.Mist[2007]DvDrip[Eng]-aXXo
</span>
<span class="l r neutral-icon">
Vietnamese
</span>
<span>
The.Mist.2007.720p.Bluray.x264.YIFY 
</span>

As you can see, there are either spans with the class of "l r positive-icon" or "l r neutral-icon". I want to get only the languages, so everything between the span with any class. I use this regexp but it gives me an empty list:

allLanguages = re.findall('<span class=".*">\s(.*)\s</span>', allLanguagesTags)

allLanguagesTags contains the source code shown above. Can anybody give me a hint?

Why not use an actual HTML parser to parse this? Trying to extract info from HTML with regular expressions is known to [cause some issues](http://stackoverflow.com/q/1732348). — Martijn Pieters, May 17 '14 at 12:13
@MartijnPieters I am using BeautifulSoup to get all the s with the class of "a1", but I don't know how to extract the content from within the tags with BeautifulSoup, so I'm using regular expressions. — jvitasek, May 17 '14 at 12:17
@user3647430: why didn't you ask about that in the first place? — Martijn Pieters, May 17 '14 at 12:19
@MartijnPieters I didn't realise I could do that with BeautifulSoup. Thanks for your answer, I'm going to check it out. — jvitasek, May 17 '14 at 12:20

score 3 · Accepted Answer · answered May 17 '14 at 12:16

Don't use regular expressions. Use an actual HTML parser. I recommend you use BeautifulSoup instead:

from bs4 import BeautifulSoup

soup = BeautifulSoup(yourhtml)
languages = [s.get_text().strip() for s in soup.find_all('span', class_=True)]

Demo:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <span class="l r positive-icon">
... Turkish
... </span>
... <span>
... The.Mist[2007]DvDrip[Eng]-aXXo
... </span>
... <span class="l r neutral-icon">
... Vietnamese
... </span>
... <span>
... The.Mist.2007.720p.Bluray.x264.YIFY 
... </span>
... ''')
>>> [s.get_text().strip() for s in soup.find_all('span', class_=True)]
[u'Turkish', u'Vietnamese']

Love that soup of yours, Martijn. +1 :) – zx81 May 18 '14 at 03:07 — zx81, May 18 '14 at 03:07

Why isn't this regexp working

1 Answers1