using stored variables as regex patterns

Question

is there a way for python to use values stored in variables as patterns in regex?

supposing i have two variables:

begin_tag = '<%marker>'
end_tag = '<%marker/>'

doc = '<html> something here <%marker> and here and here <%marker/> and more here <html>'

how do you extract the text between begin_tag and end_tag?

the tags are determined after parsing another file, so they're not fixed.

If you have to ask this, you don't get something very fundamental. Yes, no matter if you have a variable containing `foo` or hardcode `foo`, you can use both the same way. But apart from that, obligatory comment to `/reg(ular )?ex(pression)?.*html/i`: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — , Nov 20 '10 at 18:25
are you sure you want `<` tag `>` followed by ` <` tag `/>` and not by `` tag `>`? — SingleNegationElimination, Nov 20 '10 at 18:41
it doesn't really matter, the tag is a custom one, and i just needed some way of marking the end of a section of text. — momo, Nov 20 '10 at 21:45

SingleNegationElimination · Answer 1 · 2010-11-21T19:21:14.463

2

Don't use a regex at all. parse html inteligently!

from BeautifulSoup import BeautifulSoup
marker = 'mytag'
doc = '<html>some stuff <mytag> different stuff </mytag> other things </html>'
soup = BeautifulSoup(doc)
print soup.find(marker).renderContents()

edited Nov 21 '10 at 19:21

answered Nov 20 '10 at 18:48

SingleNegationElimination

151,563
33
264
304

i can't make BeautifulSoup parse custom tags, which is what i'm trying to do. – momo Nov 20 '10 at 21:44
1

... Um. Why would you want a custom tag in HTML? Do you really need XML? Should you be using a different template langauge for this task? – SingleNegationElimination Nov 21 '10 at 19:16

score 1 · Answer 2 · answered Nov 20 '10 at 18:29

1

Regular expressions are strings. So you can do anything you want to build them: concatenate them (using + operator), interpolation (using % operator), etc. Just concatenate the variables you want to match with the regex you want to use:

begin_tag + ".*?" + end_tag

The only pitfall is when your variables contain characters that might be taken by the regular expression engine to have special meaning. You need to make sure they are escaped properly in that case. You can do this with the re.escape() function.

The usual caveat ("don't parse HTML with regular expressions") applies.

answered Nov 20 '10 at 18:29

kindall

178,883
35
278
309

1

I bet that a safer choice would be to use `re.escape(begin_tag) + ".*?" + re.escape(end_tag)`. – tzot Nov 20 '10 at 19:08
i've heard that it's not a good idea to parse HTML with regexp, but besides using certain libraries to parse for you, what other option is there? i'm not aware of a way to modify python's "method_missing" method, which would allow the creation of a DSL that would be able to handle this. ironically, even though my experience in ruby or io is extremely limited (a few days), i can write something in those languages that can handle this specific case that i can't do in python. – momo Nov 20 '10 at 21:52

using stored variables as regex patterns

2 Answers2