0

I want to make a python script, that look for:

    <span class="toujours_cacher">(.)*?</span> 

I use this RE:

    r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?\<\/span\>"

However, in some of my pages, I found this kind of expression

    <span class="toujours_cacher">*
    <span class="exposant" size="1">*</span> *</span>

so I tried this RE:

    r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?(\<\/span\>|\<\/span\>(.|\n)*?<\/span>)"

which is not good, because when there is no span in between, it looks for the next .

I need to delete the content between the span with the class "toujours_cacher". Is there any way to do it with one RE?

I will be pleased to hear any of your suggestions :)

whitefret
  • 9
  • 3
  • 6
    If you're parsing HTML or XML, please don't try to use regex. Take a look at ETree, BeautifulSoup, or some other parsing library. – Morgan Thrapp Jul 01 '15 at 14:30
  • This does not do what you think it does: `(.|\n)*` . You need to learn about character classes, and where to put the parenthesis around capture groups. – le3th4x0rbot Jul 01 '15 at 14:32
  • I can only use 'standart' python for this, it is a script for people who don't know anything about programming and who won't be able to install modules – whitefret Jul 01 '15 at 14:34
  • @whitefret Well then teach them to parse a `regular` text file or something using regular expressions. Not a webpage, because you are teaching them to use regex incorrectly – heinst Jul 01 '15 at 14:42

2 Answers2

0

This is (provably) impossible with regular expressions - they cannot match delimiters to arbitrary depth. You'll need to move to using an actual parser instead.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
0

Please do not use regex to parse HTML, as it is not regular. You could use BeautifulSoup. Here is an example of BeautifulSoup finding the tag <span class="toujours_cacher">(.)*?</span>.

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlCode)
spanTags = soup.findAll('span', attrs={'class': 'toujours_cacher'})

This will return a list of all the span tags that have the class toujours_cacher.

heinst
  • 8,520
  • 7
  • 41
  • 77