1

I'm trying to find a substring of a string s, starting with {{Infobox and ending with }}. I tried doing this with a regular expression, but it doesn't get any results. I think the fault is in my regular expression, but since I'm quitte new to regex, I hope someone can help with this. String s is for example:

s = '{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}'

result = re.search('(.*)\{\{Infobox (.*)\}\}(.*)', s)
if result:
    print(result.group(2))
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
maxmijn
  • 337
  • 1
  • 3
  • 10
  • Exactly what are you expecting as output? – Anand S Kumar Oct 13 '15 at 10:37
  • I suggest you have a play with e.g. https://regex101.com/r/rB2bM0/1, and note that you should use raw (`r''`) strings with regex to avoid issues with backslashes. – jonrsharpe Oct 13 '15 at 10:40
  • The string from 'persoon..' to '...JPG', so everything that's in the 'Infobox' – maxmijn Oct 13 '15 at 10:40
  • I tried like re.search('{{Infobox.*?}}',s).group() , which giving me the result in py 2.6. But doing group at the time of match object will be right approach – Vineesh Oct 13 '15 at 11:56

3 Answers3

4

You can use lazy dot matching since your delimiters are not one-symbol delimiters, and capture what you need into group 1:

import re
p = re.compile(r'\{\{Infobox\s*(.*?)}}')
test_str = "{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}"
match = p.search(test_str)
if match:
    print(match.group(1))

See IDEONE demo

If you use a negated character class, any { or } inside the Infobox will prevent from matching the whole substring.

Also, since you do not seem to need the substrings before and after the substring you need, you do not need to match (or capture) them at all (thus, I removed them).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Mind that in case you have a newline in your Infobox, you will need to use `re.S`/`re.DOTALL` modifier: [`p = re.compile(r'\{\{Infobox\s*(.*?)}}', re.S)`](https://ideone.com/cpm54O). – Wiktor Stribiżew Oct 14 '15 at 06:30
  • In practice this won't work terribly well since the infobox can (and often does) contain other templates. Parsing tree structures with regex is generally a bad idea ([the `{{center}}` cannot hold](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) etc). Use something like [mwparserfromhell](http://mwparserfromhell.readthedocs.org/en/latest/usage.html) instead. – Tgr Oct 16 '15 at 06:58
2

Code:

import re
s = '{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}'

result = re.search(r'(.*){{Infobox ([^}]*?)}}(.*)', s)
if result:
    print(result.group(2))

Output:

persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG

NOTE: The above regex will match till it meets the first } after {{Infobox.

Important note:

This will only work for the cases like the given sample input

It will not work if the input has a } in between i.e){{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| }afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}} For cases like that stribizhev's answer is the best solution

The6thSense
  • 8,103
  • 8
  • 31
  • 65
  • @maxmijn happy to help – The6thSense Oct 13 '15 at 11:02
  • The **the above regex will match till it meets }}** statement is wrong. If you think `[^}}]` matches 2 characters other than `}` you are mistaken. It only matches **one** non-`}`. So, if the path contains `}`, this regex will fail. Actually, there are 2 ways here: 1) a tempered greedy token, 2) lazy dot matching. The latter is more efficient, and my suggestion is based on that. – Wiktor Stribiżew Oct 13 '15 at 11:10
  • @stribizhev agreed removing my answer – The6thSense Oct 13 '15 at 11:16
  • No why removing? You can provide an alternative with a tempered greedy token that will be just a bit inefficient. – Wiktor Stribiżew Oct 13 '15 at 11:16
  • @stribizhev I don't what to pollute SO with wrong answer :) and I dont know that much about regex to rewrite by regex – The6thSense Oct 13 '15 at 11:17
  • I edited your answer so that the description is in line with the code (and trimmed the regex a bit). – Wiktor Stribiżew Oct 13 '15 at 11:21
  • @stribizhev then to it will not match this case `s = '{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| }afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}'` with a `}` in between where as your answer does match that case – The6thSense Oct 13 '15 at 11:22
  • 1
    @VigneshKalai: Correct. That is what I was driving at from the beginning. – Wiktor Stribiżew Oct 13 '15 at 11:24
  • @stribizhev agreed :) so I will just state that this will only work for this case and will not work for cases with `}` in between and by the way I can't compete with regex masters – The6thSense Oct 13 '15 at 11:26
0
s = '{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}'

# start with Infobox and two chars before, grab everything but '}', followed by two chars
mo = re.search(r'(..Infobox[^}]*..)',s)


print(mo.group(1))


# {{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}
LetzerWille
  • 5,355
  • 4
  • 23
  • 26