9

I am looking to use regex to extract text which occurs between two strings. I know how to do if i want to extract between the same strings every time (and countless questions asking for this e.g. Regex matching between two strings?), but I want to do it using variables which change, and may themselves include special characters within Regex. (i want any special characters, e.g. * treated as text).

For example if i had:

text = "<b*>Test</b>"
left_identifier = "<b*>"
right_identifier = "</b>

i would want to create regex code which would result in the following code being run:

re.findall('<b\*>(.*)<\/b>',text)

It is the <b\*>(.*)<\/b> part that I don't know how to dynamically create.

Community
  • 1
  • 1
kyrenia
  • 5,431
  • 9
  • 63
  • 93
  • 3
    You may want to consider a non-greedy quantifier: `(.*?)` this matches as few characters as possible. so in the case of a string like "{left_identifier}stuff{right_identifier} {left identifier}more stuff{right_identifier}", you'll find only "stuff" and "more stuff" in two separate matches instead of "stuff{right_identifier} {left identifier}more stuff" in one match. – Shashank Apr 15 '15 at 17:20
  • thanks - good spot - you are right - the non-greedy quantifier was what i mean! – kyrenia Apr 15 '15 at 17:36
  • 1
    Please note that using regex to parse HTML [is not recommended](http://stackoverflow.com/a/1732454/405017). You should use an HTML parser (whatever Python's equivalent of [Nokogiri](http://nokogiri.org) is) and then extract text from the appropriate tag. – Phrogz Apr 15 '15 at 19:00
  • @Phrogz - example was simplified - not parsing based on html tags in general, (although need to be able to cope with them, as they do crop into the text i am inputting). [for reference BeautifulSoup is the equivalent html parser in python]. – kyrenia Apr 15 '15 at 21:43

4 Answers4

7

You can do something like this:

import re
pattern_string = re.escape(left_identifier) + "(.*?)" + re.escape(right_identifier)
pattern = re.compile(pattern_string)

The escape function will automatically escape special characters. For eg:

>>> import re
>>> print re.escape("<b*>")
\<b\*\>
Alexandru Chirila
  • 2,274
  • 5
  • 29
  • 40
  • 1
    Please also note the `(.*?)` instead of `(.*)`, which is non-greedy capturing. Which is likely what you want here. – Alexandru Chirila Apr 15 '15 at 17:17
  • I tried performing the above with "PRIMARY KEY(\n" as the left identifier and ")" as the right identifier but didn't work for me. I wanted to get all the primary keys from the below: PRIMARY KEY (ROLE_ID) USING INDEX APP_ROLES.SR_PK ENABLE VALIDATE); – RB17 Aug 30 '19 at 12:45
5

You need to re.escape the identifiers:

>>> regex = re.compile('{}(.*){}'.format(re.escape('<b*>'), re.escape('</b>')))
>>> regex.findall('<b*>Text</b>')
['Text']
agf
  • 171,228
  • 44
  • 289
  • 238
4

The regex starts its life just as a string, so left_identifier + text + right_identifier and use that in re.compile

Or:

re.findall('{}(.*){}'.format(left_identifier, right_identifier), text)

works too.

You need to escape the strings in the variables if they contain regex metacharacter with re.escape if you do not want the metacharacters interpreted as such:

>>> text = "<b*>Test</b>"
>>> left_identifier = "<b*>"
>>> right_identifier = "</b>"
>>> s='{}(.*?){}'.format(*map(re.escape, (left_identifier, right_identifier)))
>>> s
'\\<b\\*\\>(.*?)\\<\\/b\\>'
>>> re.findall(s, text)
['Test']

On a side note, str.partition(var) is an alternate way to do this:

>>> text.partition(left_identifier)[2].partition(right_identifier)[0]
'Test'
dawg
  • 98,345
  • 23
  • 131
  • 206
0

I know you actually wanted a regex solution, but I really wonder if regex is the right tool here considering we all have taken oath not to. When parsing html strings, I will always recommend to fall back to beautifulsoup

>>> import bs4
>>> bs4.BeautifulSoup('<b*>Text</b>').text
u'Text'
Community
  • 1
  • 1
Abhijit
  • 62,056
  • 18
  • 131
  • 204