0

I know in regex we can use ^ to declare something except. For example [^ ]*? means a string with no space. How we can use this to find the except for more than two consecutive character. Fro example a string that doesn't contain {{ when it can contain a single {. I tried these and didn't work:

re.compile(r"(\{\{`[^(\{\{)]*?\}\}`)
re.compile(r"(\{\{`[^\{\{]*?\}\}`)

This is to catch strings in a file that starts with {{ and ends with }} but doesn't contains }} while they can contain a single }. Also using .* is not an option.

input_string="blah blah blah {{cite journal |last=Malatesta|first=Errico|title=Towards Anarchism|journal=MAN!|publisher=International Group of San Francisco|location=Los Angeles|oclc=3930443|url=http://www.marxists.org/archive/malatesta/1930s/xx/toanarchy.htm|archiveurl=http://web.archive.org/web/20121107221404/http://marxists.org/archive/malatesta/1930s/xx/toanarchy.htm|archivedate=7 November 2012 |deadurl=no|authorlink=Errico Malatesta |ref=harv}} blah blah blah"
regexp_1 = re.compile(r"(\{\{[^\}]*?\}\})") 
output = regexp_1.sub("",input_string )

Now regexp_1, I want to replace [^\}]*? with [^\}\}]*? and I know that [^\}\}]*? is not correct since it works the same way as [^\}]*?.

Nick
  • 367
  • 4
  • 6
  • 13
  • Well, to start what you have shown in your code here won't compile, it's missing parentheses and quotation marks. Could you also provide some example text that you would like to parse, please? – themantalope Dec 30 '15 at 20:12
  • As far as I know, you can't use something like `[^word]` since this will only match whatever character but `w`, `o`, `r`, `d`. Also I know you can use negative lookaheads like `myword(?!something)` to match `myword` only if it is not followed with `something`. However, I know there are some tricks to match whatever except a word – Federico Piazza Dec 30 '15 at 20:29
  • can you post some sample data for what you want and what you don't? – Federico Piazza Dec 30 '15 at 20:44
  • I added that @FedericoPiazza – Nick Dec 30 '15 at 21:42
  • Nick, I wrote you a working answer. I'll post it after you go through and accept a few answers to your many previous questions. – Autumn Dec 30 '15 at 22:08
  • @DSJustice I could not find my answer if you can't see accepted answer for those questions although I upvoted their efforts. Thanks for your time but that's not the right way. – Nick Dec 30 '15 at 22:14

4 Answers4

1

This is to catch strings in a file that starts with {{ and ends with }} but doesn't contains }} while they can contain a single }

your_string = "{{first group}} {{second {} group}}"
pattern = re.compile(r'{{.*?}}')
pattern.findall(your_string)  # returns list of matches 

Which will return

['{{first group}}', '{{second {} group}}']
Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
  • it is not exactly the use case. Plz have a look at my example. – Nick Dec 30 '15 at 21:54
  • Using `.*` is out of the equation. I know I can use `.*` but I want to avoid it. – Nick Dec 30 '15 at 22:00
  • It's actually `.*?` which is a non-greedy (unexpensive) form. Unless you have an assignment asking you to do it differently, it's the most straight-forward and efficient solution. – Martin Konecny Dec 30 '15 at 22:02
  • Look at the differences between the steps here: https://regex101.com/r/qJ3uK7/1 and also here https://regex101.com/r/qJ3uK7/2 that is one of the reasons I want to avoid `.*` – Nick Dec 30 '15 at 22:07
  • hmmm, that doesn't make any sense as the work done between `{{.*?}}` and `{{[^}]*?}}` (you don't need to escape `}`) should be almost exactly the same. One allows any character until `}}`. The other allows any character *except* for `}` until `}}`. – Martin Konecny Dec 30 '15 at 22:16
  • That's why. If the profiling works well, then it takes much more steps and I'm decreasing the complexity of my code since I need to runs over millions of ling lines. – Nick Dec 30 '15 at 22:20
  • I'd suggest you profile the actual implementation in Python instead of using an online tool that most likely won't reflect your real-world results. See http://stackoverflow.com/questions/1593019/is-there-any-simple-way-to-benchmark-python-script . If you use `findall` there shouldn't be any difference. Best of luck! – Martin Konecny Dec 30 '15 at 22:22
  • I tried profiling it differently and by changing all `.*` to `[^X]` where `X` is what I know it will not happened in the sequence, I made it faster around 20x. It's just interesting why it is like that in long lines. – Nick Dec 30 '15 at 22:39
  • I ended up profiling the two solutions, and it was 6 seconds for 100 runs of `.*?` vs 4 seconds for 100 runs of `[^}]*?`. Now the issue of you want to check two characters at a time will not work - why? Because regex consumes one character at a time. If for each character you are looking one character ahead, you are doubling your work. You can try `{{([^}]|}[^}])*?}}` (allow all non `}` and all single `}`), and the time jumps to 14 seconds for 100 iterations. – Martin Konecny Dec 31 '15 at 01:26
1

It looks like what you actually want is to match first }} after {{. The easiest regexp which will do this is:

\{\{.*?\}\}

Make sure to configure . to match line breaks if you allow them to be inside.

If you concerned about performance I would say that this regexp is one of the fastest one. Alternatives would be:

1) Use negative lookahead

\{\{((?!\}\}).)*\}\}

Have comparable performance as you will have look ahead check for every character

2) Use atomic group and possessive quantifier

\{\{(?>[^{]|\{[^{])**\}\}

This one might actually be faster as due to use of "?>" and "**" construction it won't dive up already matched values - so will do everything with single run. P.S.: make sure your regexp engine supports this constructions.

Community
  • 1
  • 1
Dmitry
  • 1,263
  • 11
  • 15
  • It doesn't still work. Plz look at my example in the question. – Nick Dec 30 '15 at 21:52
  • @Nick I updated answer to match what you meant. P.S.: your question title doesn't really reflect what you meant. – Dmitry Dec 30 '15 at 22:25
  • Thanks. As I mentioned I want to avoid `.*` since it will adds a lot more into the complexity of my regex. – Nick Dec 30 '15 at 22:36
  • @Nick can you describe what is exact problem with ".* ?" Cause I found it most appropriate solution. Regarding visual complexity - othere solutions even worse. – Dmitry Dec 30 '15 at 22:44
  • Profiling and testing for my cases, `.*` adds more steps and it is much much slower. I don't know why. My original code was based on `.*`, that's why I asked this question. – Nick Dec 30 '15 at 22:52
  • @Nick So the problem is only in performance. Then you can try 2-nd option. – Dmitry Dec 30 '15 at 23:20
0

For that case you can use a negative look ahead:

^((?!}}).)*$

And for catching the string between {{ and }} you can use re.search() with aforementioned regex.

>>> s = 'this {{ is {a} sample }}text'
>>> re.search(r'{{(((?!}}).)*)}}',s).group(1)
' is {a} sample '
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

As far as I know, you can't use something like [^word] since this will only match whatever character but w, o, r, d.

Also I know you can use negative lookaheads like myword(?!something) to match myword only if it is not followed with something.

However, to match something that is not a word I know you have to use some tricks like what is described in this post Match everything except for specified strings

For your specific case, you can use this regex to check if the line contains {{:

^(?!.*\{\{)

Regex Demo

On the other hand, if you use PCRE regex then you can use the discard verbs, so if you want to skip patterns like {{something}}, you can use this:

\{\{\w+\}\}(*SKIP)(*FAIL)|(\w+)
           ^^^^^^^^^^^^^^ if your pattern matches, it will be discarded intentionally 

Working demo

Community
  • 1
  • 1
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • I know I can't use `[^word]`. But if I want to let's say use `word` with negation, then how can it work? I don't want to use `.*` since the string is long and it adds on steps. – Nick Dec 30 '15 at 21:30
  • @Nick there is no way to skip a word using a built-in feature, except you use the discard verbs. I've updated my answer with that – Federico Piazza Dec 31 '15 at 12:35