0
str='filename=1817616353&realname=Arguments%20for%20&%20against%20protection%20.pdf&code2=pds'
ptn='(?<=realname=).+(?=&)'
re.search(ptn,str).group()

well, when i run this code i'm expecting to get

'Arguments%20for%20'

as the match, but instead it gives me

'Arguments%20for%20&%20against%20protection%20.pdf'

i thought the match should occur at the first occurrence of '&', which is right after 'for%20' part, so i have no idea why it's going all the way down to 'pdf'. what am i doing wrong?

Mazdak
  • 105,000
  • 18
  • 159
  • 188

2 Answers2

1

Your assumption that the first occurrence of & would match is fundamentally wrong.

.+ means match as many as possible of any character (except newline). Thus this causes anything after it to be matched at the last possible position.

A common fix for "I want as few as possible" is to use a greedy quantifier .+? which means match as few as possible but it could still end up matching things you don't want.

If you really mean "match the first possible &" then the expression you should repeat before it is "anything except &".

ptn=r'(?<=realname=)[^&]+(?=&)'

(Notice also the use of an r'...' string. It doesn't make any difference here, but it's another common newbie error -- you want backslashes in your regex and don't understand why Python is losing them.)

This is basically a restatement of the other answer on this page but hopefully easier for a beginner to digest.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

Use a negated character class instead of .+:

In [5]: ptn='(?<=realname=)[^&]+(?=&)'

In [6]: re.search(ptn,str).group()
Out[6]: 'Arguments%20for%20'

Although you can use a non greedy quantifier by adding ? at the trailing of .*, but using a negated character class will give you a better performance in this case:

In [7]: ptn='(?<=realname=).+?(?=&)'

In [9]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.46 us per loop

In [10]: ptn='(?<=realname=)[^&]+(?=&)'

In [11]: %timeit re.search(ptn,str).group()
1000000 loops, best of 3: 1.18 us per loop

For more info read the following post regard the difference between non-greedy quantifier and negated character classes. Which would be better non-greedy regex or negated character class?

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • @downvoter, Please use your downvoting privilege whenever you encounter [an answer that is clearly and perhaps dangerously incorrect](https://stackoverflow.com/help/privileges/vote-down). – Mazdak Nov 14 '17 at 12:53
  • I'm not a downvoter (rather the opposite) but perhaps mention that the OP has a fundamental misunderstanding of how regex works. The expectation that the first occurrence of `&` should match is exactly wrong, and typical of not understanding longest-leftmost matching + backtracking which are rather fundamental concepts. – tripleee Nov 15 '17 at 12:06
  • @tripleee Indeed! and that's why I suggested the negated character class at the first place rather than lazy matching. (regardless of its performance) Because it helps the OP to not only knows the solution but also poses a lot of questions in their mind that will lead to understanding those concepts. – Mazdak Nov 15 '17 at 12:16
  • Can I suggest an edit or would you prefer that I post a separate answer? – tripleee Nov 15 '17 at 12:18
  • @tripleee That's up you, if you think your answer is something different I'd strongly encourage you to post your answer. – Mazdak Nov 15 '17 at 12:19
  • I'll create a slightly different exposition but if you agree with the wording, feel free to copy/paste it here and I'll delete it. – tripleee Nov 15 '17 at 12:30
  • 1
    @tripleee No your answer is explaining the underlying misunderstanding and definitely worth being a separate answer. – Mazdak Nov 15 '17 at 12:39