0

I begin to learn re module. First I'll show the original code:

import re
cheesetext = u'''<tag>I love cheese.</tag>
<tag>Yeah, cheese is all I need.</tag>
<tag>But let me explain one thing.</tag>
<tag>Cheese is REALLY I need.</tag>
<tag>And the last thing I'd like to say...</tag>
<tag>Everyone can like cheese.</tag>
<tag>It's a question of the time, I think.</tag>'''

def action1(source):
  regex = u'<tag>(.*?)</tag>'
  pattern = re.compile(regex, re.UNICODE | re.DOTALL | re.IGNORECASE)
  result = pattern.findall(source)
  return(result)

def action2(match, source):
  pattern = re.compile(match, re.UNICODE | re.DOTALL | re.IGNORECASE)
  result = bool(pattern.findall(source))
  return(result)

result = action1(cheesetext)
result = [item for item in result if action2(u'cheese', item)]
print result
>>> [u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']

And now what I need. I need to do the same thing using one regex. It was an example, I have to process much more information than these cheesy texts. :-) Is it possible to combine these two actions in one regex? So the question is: how can I use conditions in regex?

ghostmansd
  • 3,285
  • 5
  • 30
  • 44
  • By the way, it looks like you're trying to parse SGML/HTML/XML using regular expressions. That's not always the best way to go, regular expressions treat everything as a flat string while markup languages describe a tree. Whatever you do, do _not_ try to escape HTML using regular expressions, or [samy will be your hero](http://namb.la/popular/tech.html). – cha0site Feb 08 '12 at 09:25

3 Answers3

2
>>> p = u'<tag>((?:(?!</tag>).)*cheese.*?)</tag>'
>>> patt = re.compile(p, re.UNICODE | re.DOTALL | re.IGNORECASE)
>>> patt.findall(cheesetext)
[u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']

This uses a negative-lookahead assertion. A good explanation of this is given by Tim Pietzcker in this question.

Community
  • 1
  • 1
beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • you need the negative-lookahead on both sides of "cheese" – ptitpoulpe Feb 08 '12 at 10:09
  • Why? You're already using a reluctant `.*?`, so the match will stop at `` anyway. – beerbajay Feb 08 '12 at 10:27
  • Ha, no problem. I'm pretty sure your version also works, it just does some unnecessary computation. – beerbajay Feb 08 '12 at 10:39
  • @beerbajay: Thanks, I thank it's the best answer! One question. Can I add here two more conditions: word will be in list if "cheese" is not a part of 'BARcheeseFOO' or 'FOOcheeseBAR'? I don't understand where I must insert condition. – ghostmansd Feb 08 '12 at 19:14
  • The more conditions you have, the more difficult to read the regex becomes. You **can** have these conditions, but it's almost easier to do the analysis in several steps. Also, what about this case: `I love cheese, but hate BARcheeseFOO`? – beerbajay Feb 09 '12 at 08:17
1

You can use |.

>>> import re
>>> m = re.compile("(Hello|Goodbye) World")
>>> m.match("Hello World")
<_sre.SRE_Match object at 0x01ECF960>
>>> m.match("Goodbye World")
<_sre.SRE_Match object at 0x01ECF9E0>
>>> m.match("foobar")
>>> m.match("Hello World").groups()
('Hello',)

In addition, if you need actual conditions, you can use conditionals on previously matched groups with (?=...), (?!...), (?P=name) and friends. See Python's re module docs.

cha0site
  • 10,517
  • 3
  • 33
  • 51
1

I propose to use look foward to check you don't get a </tag> inside

re.findall(r'<tag>((?:(?!</tag>).)*?cheese(?:(?!</tag>).)*?)</tag>', cheesetext)
ptitpoulpe
  • 684
  • 4
  • 17