0

is there a way for python to use values stored in variables as patterns in regex?

supposing i have two variables:

begin_tag = '<%marker>'
end_tag = '<%marker/>'

doc = '<html> something here <%marker> and here and here <%marker/> and more here <html>'

how do you extract the text between begin_tag and end_tag?

the tags are determined after parsing another file, so they're not fixed.

momo
  • 1,045
  • 3
  • 9
  • 18
  • If you have to ask this, you don't get something very fundamental. Yes, no matter if you have a variable containing `foo` or hardcode `foo`, you can use both the same way. But apart from that, obligatory comment to `/reg(ular )?ex(pression)?.*html/i`: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 –  Nov 20 '10 at 18:25
  • are you sure you want `<` tag `>` followed by ` <` tag `/>` and not by `` tag `>`? – SingleNegationElimination Nov 20 '10 at 18:41
  • it doesn't really matter, the tag is a custom one, and i just needed some way of marking the end of a section of text. – momo Nov 20 '10 at 21:45

2 Answers2

2

Don't use a regex at all. parse html inteligently!

from BeautifulSoup import BeautifulSoup
marker = 'mytag'
doc = '<html>some stuff <mytag> different stuff </mytag> other things </html>'
soup = BeautifulSoup(doc)
print soup.find(marker).renderContents()
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
1

Regular expressions are strings. So you can do anything you want to build them: concatenate them (using + operator), interpolation (using % operator), etc. Just concatenate the variables you want to match with the regex you want to use:

begin_tag + ".*?" + end_tag

The only pitfall is when your variables contain characters that might be taken by the regular expression engine to have special meaning. You need to make sure they are escaped properly in that case. You can do this with the re.escape() function.

The usual caveat ("don't parse HTML with regular expressions") applies.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • 1
    I bet that a safer choice would be to use `re.escape(begin_tag) + ".*?" + re.escape(end_tag)`. – tzot Nov 20 '10 at 19:08
  • i've heard that it's not a good idea to parse HTML with regexp, but besides using certain libraries to parse for you, what other option is there? i'm not aware of a way to modify python's "method_missing" method, which would allow the creation of a DSL that would be able to handle this. ironically, even though my experience in ruby or io is extremely limited (a few days), i can write something in those languages that can handle this specific case that i can't do in python. – momo Nov 20 '10 at 21:52