2

I am trying to create list of tuples with the data after strings string1 and string3. But not getting expected result.

s = 'string1:1234string2string3:a1b2c3string1:2345string3:b5c6d7'
re.findall('string1:(\d+)[\s,\S]+string3:([\s\S]+',s)

Actual result:

[('1234', 'b5c6d7)']

Expected result:

[('1234', 'a1b2c3'), ('2345', 'b5c6d7')]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Possible duplicate of [My regular expression matches too much. How can I tell it to match the smallest possible pattern?](https://stackoverflow.com/questions/7014903/my-regular-expression-matches-too-much-how-can-i-tell-it-to-match-the-smallest) – Sebastian Proske Jul 16 '18 at 07:32
  • a little far fetched duplicate. Related, yes, exact duplicate, well no since it requires some more work to make that work. – Jean-François Fabre Jul 16 '18 at 07:41

2 Answers2

3

You current regex uses [\s,\S]+ which is greedy and matches all characters until the end of the line.

You could make it non greedy and use a positive lookahead (?=string|$) for the last match that assert what follows is either string or the end of the line $.

string1:(\d+).*?string3:(.*?)(?=string|$)

import re 
s = 'string1:1234string2string3:a1b2c3string1:2345string3:b5c6d7'
print(re.findall('string1:(\d+).*?string3:(.*?)(?=string|$)',s))

Demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

The problem is that [\s,\S]+ is greedy and therefore consuming everything between the first string1 and the last string3.

You can fix that by using positive lookaheads and making the regex non greedy like this:

string1:(\d+)[^\d][\s,\S]+?string3:([\s\S]+?(?=string|$))
Teekeks
  • 196
  • 10