Looking for the right RE expression (python)

Question

I want to make a python script, that look for:

    <span class="toujours_cacher">(.)*?</span>

I use this RE:

    r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?\<\/span\>"

However, in some of my pages, I found this kind of expression

    <span class="toujours_cacher">*
    <span class="exposant" size="1">*</span> *</span>

so I tried this RE:

    r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?(\<\/span\>|\<\/span\>(.|\n)*?<\/span>)"

which is not good, because when there is no span in between, it looks for the next .

I need to delete the content between the span with the class "toujours_cacher". Is there any way to do it with one RE?

I will be pleased to hear any of your suggestions :)

If you're parsing HTML or XML, please don't try to use regex. Take a look at ETree, BeautifulSoup, or some other parsing library. — Morgan Thrapp, Jul 01 '15 at 14:30
This does not do what you think it does: `(.|\n)*` . You need to learn about character classes, and where to put the parenthesis around capture groups. — le3th4x0rbot, Jul 01 '15 at 14:32
I can only use 'standart' python for this, it is a script for people who don't know anything about programming and who won't be able to install modules — whitefret, Jul 01 '15 at 14:34
@whitefret Well then teach them to parse a `regular` text file or something using regular expressions. Not a webpage, because you are teaching them to use regex incorrectly — heinst, Jul 01 '15 at 14:42

score 0 · Answer 1 · answered Jul 01 '15 at 14:33

0

This is (provably) impossible with regular expressions - they cannot match delimiters to arbitrary depth. You'll need to move to using an actual parser instead.

answered Jul 01 '15 at 14:33

Toby Speight

27,591
48
66
103

score 0 · Answer 2 · answered Jul 01 '15 at 14:35

0

Please do not use regex to parse HTML, as it is not regular. You could use BeautifulSoup. Here is an example of BeautifulSoup finding the tag <span class="toujours_cacher">(.)*?</span>.

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlCode)
spanTags = soup.findAll('span', attrs={'class': 'toujours_cacher'})

This will return a list of all the span tags that have the class toujours_cacher.

answered Jul 01 '15 at 14:35

heinst

8,520
7
41
77

thank you for your answer, I guess I will have to make a note for BeautifulSoup's install – whitefret Jul 01 '15 at 14:42

Looking for the right RE expression (python)

2 Answers2