Regex to extract between two strings (which are variables)

Question

I am looking to use regex to extract text which occurs between two strings. I know how to do if i want to extract between the same strings every time (and countless questions asking for this e.g. Regex matching between two strings?), but I want to do it using variables which change, and may themselves include special characters within Regex. (i want any special characters, e.g. * treated as text).

For example if i had:

text = "<b*>Test</b>"
left_identifier = "<b*>"
right_identifier = "</b>

i would want to create regex code which would result in the following code being run:

re.findall('<b\*>(.*)<\/b>',text)

It is the <b\*>(.*)<\/b> part that I don't know how to dynamically create.

You may want to consider a non-greedy quantifier: `(.*?)` this matches as few characters as possible. so in the case of a string like "{left_identifier}stuff{right_identifier} {left identifier}more stuff{right_identifier}", you'll find only "stuff" and "more stuff" in two separate matches instead of "stuff{right_identifier} {left identifier}more stuff" in one match. — Shashank, Apr 15 '15 at 17:20
thanks - good spot - you are right - the non-greedy quantifier was what i mean! — kyrenia, Apr 15 '15 at 17:36
Please note that using regex to parse HTML [is not recommended](http://stackoverflow.com/a/1732454/405017). You should use an HTML parser (whatever Python's equivalent of [Nokogiri](http://nokogiri.org) is) and then extract text from the appropriate tag. — Phrogz, Apr 15 '15 at 19:00
@Phrogz - example was simplified - not parsing based on html tags in general, (although need to be able to cope with them, as they do crop into the text i am inputting). [for reference BeautifulSoup is the equivalent html parser in python]. — kyrenia, Apr 15 '15 at 21:43

score 7 · Answer 1 · answered Apr 15 '15 at 17:14

7

You can do something like this:

import re
pattern_string = re.escape(left_identifier) + "(.*?)" + re.escape(right_identifier)
pattern = re.compile(pattern_string)

The escape function will automatically escape special characters. For eg:

>>> import re
>>> print re.escape("<b*>")
\<b\*\>

answered Apr 15 '15 at 17:14

Alexandru Chirila

2,274
5
29
40

1

Please also note the `(.*?)` instead of `(.*)`, which is non-greedy capturing. Which is likely what you want here. – Alexandru Chirila Apr 15 '15 at 17:17
I tried performing the above with "PRIMARY KEY(\n" as the left identifier and ")" as the right identifier but didn't work for me. I wanted to get all the primary keys from the below: PRIMARY KEY (ROLE_ID) USING INDEX APP_ROLES.SR_PK ENABLE VALIDATE); – RB17 Aug 30 '19 at 12:45

score 5 · Accepted Answer · answered Apr 15 '15 at 17:14

5

You need to re.escape the identifiers:

>>> regex = re.compile('{}(.*){}'.format(re.escape('<b*>'), re.escape('</b>')))
>>> regex.findall('<b*>Text</b>')
['Text']

answered Apr 15 '15 at 17:14

agf

171,228
44
289
238

dawg · Answer 3 · 2015-04-15T17:47:56.650

The regex starts its life just as a string, so left_identifier + text + right_identifier and use that in re.compile

Or:

re.findall('{}(.*){}'.format(left_identifier, right_identifier), text)

works too.

You need to escape the strings in the variables if they contain regex metacharacter with re.escape if you do not want the metacharacters interpreted as such:

>>> text = "<b*>Test</b>"
>>> left_identifier = "<b*>"
>>> right_identifier = "</b>"
>>> s='{}(.*?){}'.format(*map(re.escape, (left_identifier, right_identifier)))
>>> s
'\\<b\\*\\>(.*?)\\<\\/b\\>'
>>> re.findall(s, text)
['Test']

On a side note, str.partition(var) is an alternate way to do this:

>>> text.partition(left_identifier)[2].partition(right_identifier)[0]
'Test'

score 0 · Answer 4 · edited May 23 '17 at 11:59

0

I know you actually wanted a regex solution, but I really wonder if regex is the right tool here considering we all have taken oath not to. When parsing html strings, I will always recommend to fall back to beautifulsoup

>>> import bs4
>>> bs4.BeautifulSoup('<b*>Text</b>').text
u'Text'

edited May 23 '17 at 11:59

Community

1
1

answered Apr 15 '15 at 19:07

Abhijit

62,056
18
131
204

Regex to extract between two strings (which are variables)

4 Answers4