1

Every topic I've read combining Python's Regex (re library) and Inverse/Negative matching has focused on multiline strings as opposed to SINGLE line strings.

Beyond the fact that http://www.regextester.com/15 uses a JavaScript regex library displaying matches for the entire group (/g) and behaves differently from Python's re library (apparently according to https://rexegg.com/ there's another regex library in Python which I don't wish to use just yet), I wanted to know if there was a way to use "re.findall" (and yes re.search although I'm privy to re.findall) to do 2 things: 1. Return all individual strings that do not contain the string "hede" in qw below. 2. Return all individual strings that do not contain the string "hede" and break strings containing the string "hede" on either side.

>>> qw = "hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld"

Scenario 1 Desired Output (exclude all strings that contain "hede"):

>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> re.findall('{SOMETHING_THAT_EXCLUDES_ALL_STRINGS_COTAINING_hede}', qw)
['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'so', 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']

Scenario 2 Desired Output (include everything that doesn't contain "hede" and break strings contaiinig "hede" at "hede"):

>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> re.findall('{SOMETHING_THAT_INCLUDES_ALL_STRINGS_NOT_COTAINING_hede_AND_BREAKS_THEM_IF_THEY_DO}', qw)
['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'so' 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']

Closest I've come is so inefficient:

>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> re.findall('[\S]+(?=hede)|(?<=hede )[\S]+|(?<=hede)[\S]+|[\S]+(?= hede)|[\S]+(?=hede )|(?<= hede)[\S]+', qw)
['haha', 'rara', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'b', 'kdjkdld']

Keep in mind that qw features a single space between the terms. I couldn't help but wondering if a solution would have been possible if there were variances in spacing i.e. if qw had equaled the below:

>>> qw = "hoho hihi   haha    hede rara     a rere titi so   whdhdskhdshede wekjewhkwqjhededjfjfj  so kjkfdjkdnekjdhide        b     hede   kdjkdld"

.

Thank you guys for all of the help.

Also, in every thread I've read a variation on "^(?!hede).*$" or "^(?!.foo)." has come up for multiline posts. This doesn't work well in Python of course, but I've tried fooling around with these to no avail.

Thank you guys so much for the help!

FailSafe
  • 482
  • 4
  • 12
  • Python `re` does not support skipping matched texts. However, you may match and capture what you need, and just match what you do not need. See http://ideone.com/jh3vIN. Just `re.findall(r'hede|((?:(?!hede)\S)+)', qw)` will work as you need, right? Well, you will have empty elements. With PyPi regex module, you may get cleaner output with `regex.findall(r'hede(*SKIP)(*F)|((?:(?!hede)\S)+)', qw)` – Wiktor Stribiżew May 23 '17 at 17:04
  • Woah. I will admit that the way that I am interpreting that regex must be incorrect - this is amazing, really, but i don't understand it. At a quick glance "|hede" or "hede|" against an OR conditional would be interpretted by me as "match/return" this group. What is changing its behaviour in this context? Thanks for the prompt response by the way. Would you mind also providing the source from which you found how to make regex behave this way and return what it has? – FailSafe May 23 '17 at 17:14
  • Do you want to say you think you can use `findall(r'hede|((?:(?!hede)\S)+)', qw)`? Does it work as expected? You will need to remove empty entries with `filter(None, results)`. – Wiktor Stribiżew May 23 '17 at 17:24
  • It worked very, very well. I'm just surprised because I don't understand why. I was just wondering why "|hede" or "hede|" because they contain an OR conditional are not returning "hede". Is it because lookarounds are involved? If so, I'm learning a ton about regex ordering -- more in the last 2 days than in months of reading. Thanks so much for the help. – FailSafe May 23 '17 at 17:30
  • I posted an answer, please check and let know if it is clear enough. – Wiktor Stribiżew May 23 '17 at 18:45

1 Answers1

2

I suggest leveraging re.findall feature that is returning only captured texts:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

So, you can match and capture what you need and just match what you need to skip. See the Python demo:

import re
qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
rx = r'hede|((?:(?!hede)\S)+)'
results = re.findall(rx, qw)
print(filter(None, results))
# => ['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'so', 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']

See the Python demo.

Since the hede is not captured, it is not returned, but since there is 1 capturing group and it is not participating in the match, an empty string is added to the resulting list every time the non-captured pattern matches.

Pattern details

  • hede - match hede
  • | - or
  • ((?:(?!hede)\S)+) - match and capture into Group 1 one or more non-whitespace chars that are not the starting point for a hede sequence.

Note that in case you use PyPi regex modile, you may use the PCRE-like verbs (*SKIP)(*F):

>>> import regex
>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> print(regex.findall(r'hede(*SKIP)(*F)|((?:(?!hede)\S)+)', qw))
['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'so', 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']

Then, there is no need to filter the results.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    My gosh!! Thanks, I understand it perfectly now. To be honest, I'm really upset that I didn't think of this. There was a time where I did something akin to this __ >>> ws = 'aaabbbcccddd' ---- >>> re.findall('aaa|(bbb)', ws) => ['', 'bbb'] ___ and it totally slipped my mind that if I specified a specific capture group in parantheses it would override/ignore anything captured without parentheses. Wiktor Stribiżew, dude, thank you. This really re-activated the gears in my head. I will also start to play around with PyPi's regex module moving forward too. – FailSafe May 23 '17 at 19:02