0

I have to following string

line = '![[screenshotone.png]] and the next in the same line as ![[screenshottwo.jpg]]'

I want to fetch screenshotone.png and screenshottwo.jpg as two elements of a list from a regex search.

and using

output = re.findall('\[\[(.*)\]\]',line,re.I):

I want to get output as a list ['screenshotone.png','screenshot two.jpg'] but it's choosing ['screenshotone.png]] and the next in the same line as ![[screenshottwo.jpg']

I am not able to understand what changes I got to do in the regex pattern so that the both choices are done twice as expected.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
nichas
  • 111
  • 3
  • I have been able to find the pattern change. I have taken > '!\[\[([^\[]*)\]\]' this separates the two images into selection, but if others have another cases which this would discard please suggest. Thanks! – nichas Apr 03 '23 at 05:54
  • Change `.*` to `.*?`, making it reluctant (not greedy), causing it to match as few characters as possible, so that it will not match the first `]]` (unlike `.*`, which, being greedy, matches as many characters as possible, including the first `]]`). – Cary Swoveland Apr 03 '23 at 06:00

2 Answers2

2

Watch Out for The Greediness and make the regex "lazy" by inserting ? after .*:

>>> re.findall(r'\[\[(.*?)\]\]',line,re.I)
['screenshotone.png', 'screenshottwo.jpg']

The reason for your regex matching everything up to the last closing ]] is that * is greedy (from "Watch Out for The Greediness" above):

That is, the plus [and the * in your case] causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack. That is, it will go back to the plus [or the asterisk], make it give up the last iteration, and proceed with the remainder of the regex.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
0

Probably this will solve

x = '![[screenshotone.png]] and the next in the same line as ![[screenshottwo.jpg]]'

pattern = "\w+\.\w+"

re.findall(pattern, x)
output:
['screenshotone.png', 'screenshottwo.jpg']

'\w' -> selects all alphanumerical characters

'+' -> greedily selects 1 or more expression on the left

'\.' -> escapes the dot which is a must, (dot in regex means match all characters except and upto new line char)

So this is a pattern that should match the file names. It might require some changes based on all the different filenames in your data.

Mark as answered if this solves the problem you are facing

otaku
  • 86
  • 7