6

I'm new to Python and still learning about regular expressions, so this question may sound trivial to some regex expert, but here you go. I suppose my question is a generalization of this question about finding a string between two strings. I wonder: what if this pattern (initial_substring + substring_to_find + end_substring) is repeated many times in a long string? For example

test='someth1 var="this" someth2 var="that" '
result= re.search('var=(.*) ', test)
print result.group(1)
>>> "this" someth2 var="that"

Instead, I'd like to get a list like ["this","that"]. How can I do it?

Community
  • 1
  • 1
Nonancourt
  • 559
  • 2
  • 10
  • 21

2 Answers2

10

Use re.findall():

result = re.findall(r'var="(.*?)"', test)
print(result)  # ['this', 'that']

If the test string contains multiple lines, use the re.DOTALL flag.

re.findall(r'var="(.*?)"', test, re.DOTALL)
Alex Fine
  • 139
  • 1
  • 9
zwer
  • 24,943
  • 3
  • 48
  • 66
  • 1
    This solution does not work if the string contains `\n`. How would this answer be adapted to support: test = 'someth1 var="this \n then" someth2 var="that" ' – Alex Fine Jan 04 '21 at 02:10
  • 2
    @AlexFine if you need it to work over multiple lines, you need to set the [`re.DOTALL`](https://docs.python.org/3/library/re.html#re.DOTALL) flag when doing your matching so that a dot matches new lines. You can pass the flag explicitly as: `re.findall(r'var="(.*?)"', test, re.DOTALL)`, or use in-line syntax within the pattern: `re.findall(r'(?s)var="(.*?)"', test)`. – zwer Jan 08 '21 at 10:37
1

The problem with your current regex is that the capture group (.*) is an extremely greedy statement. After the first instance of a var= in your string, that capture group will get everything after it.

If you instead decrease the generalization of the expression to var="(\w+)", you will not have the same issue, therefore changing that line of python to:

result = re.findall(r'var="([\w\s]+)"', test)
m_callens
  • 6,100
  • 8
  • 32
  • 54
  • That will fail if the input string contains `var="foo bar"` (or any non-word character for that matter) under the assumption that he wants to extract everything between the quote marks. – zwer Feb 17 '17 at 16:20
  • @zwer yes, that may be true, but if the words within the quotes are being used as variables as per the `var=` prefix (an assumption that is probably not best to be made without OP specifying), the contents will never have a space – m_callens Feb 17 '17 at 16:22
  • `\w` will capture numbers as well, and `3this` is not a valid variable name either. – zwer Feb 17 '17 at 16:27
  • Thanks for the specification, @zwer. Yes, in fact, I'd be interested in the general case when it could be `var="foo bar"`. – Nonancourt Feb 17 '17 at 16:27
  • @Nonancourt ok, I'll make the revision now. – m_callens Feb 17 '17 at 16:27
  • @zwer 's answer is just as appropriate, I'm simply not a proponent of using `.` because of greed in expressions – m_callens Feb 17 '17 at 16:30