1

I have this (simplified) regex:

((\s(python|java)\s)?((\S+\s+and\s))?(\S+\s+(love|hate)))

I created this in the regexr environment and tested this on this sentence:

python and java love python love python and java java

Which matches:

python and java love python love python and java java

This is exactly what I wanted. So I implemented this in python:

import re
regex = re.compile("((\s(python|java)\s)?((\S+\s+and\s))?(\S+\s+(love|hate)))")
string = "python and java love python love python and java java"
print(str(re.findall(regex,string)))

However this will give:

[('python and java love', '', '', 'python and ', 'python and ', 'java love', 'love'), ('python love', '', '', '', '', 'python love', 'love')]


What causes this difference and how can this be fixed?


Update 1
Using raw strings will not work either:

import re
regex = re.compile(r'((\s(python|java)\s)?((\S+\s+and\s))?(\S+\s+(love|hate)))')
string = "python and java love python love python and java java"
print(str(re.findall(regex,string)))

This will still give:

[('python and java love', '', '', 'python and ', 'python and ', 'java love', 'love'), ('python love', '', '', '', '', 'python love', 'love')]

Update 2
I will use my other regex (other terms) because I than can exactly say what I want to match and what not:

"(?:\s(?:low|high)\s)?(?:\S+\s+and\s)?(\S+\s+stress|deficiency|limiting)"

What is should match:

low|high ANY_WORD stress|deficiency|limiting
ANY_WORD stress|deficiency|limiting
ANY_WORD and ANY_WORD stress|deficiency|limiting
ANY_WORD and ANY_WORD ANY_WORD stress|deficiency|limiting
(for the last one only allow two words after and if stress,deficiency or limiting is behind it

What is shouldn't match:

stress|deficiency|limiting (so don't match these if nothing is in front of them)
    low
    high
    ANY_WORD
    ANY_WORD and ANY_WORD

Example lists:

match:

salt and water stress
photo-oxidative stress
salinity and high light stress
low-temperature stress
Cd stress
Cu deficiency
N deficiency
IMI stress

no match:

stress
deficiency
limiting
temperature and water
low
high
water and salt
CodeNoob
  • 1,988
  • 1
  • 11
  • 33

1 Answers1

1

Your regex has many unnecessary capturing groups that are affecting output of findall.

You can convert your regex to this and make it work:

>>> regex = re.compile(r"(?:\s(?:low|high)\s)?(?:\S+\s+and\s)?\S+[ \t]+(?:stress|deficiency|limiting)")
>>> print re.findall(regex, string)

btw this works without raw string mode as well though it is recommended to use r"..." for your regex.

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • But why does this work and mine not? Can you please explain that – CodeNoob May 08 '17 at 14:26
  • It works for my example code but not exactly for what I wanted to achieve :(. Because now it will also match: re.findall(regex,"we will and love") --> "and love" but I want it to match only if the last subgroup is matched, so it should always contain love and hate – CodeNoob May 08 '17 at 20:54
  • My regex doesn't match "and love", at least not in python. @anubhava – CodeNoob May 08 '17 at 21:05
  • Sorry it does hahah that was not the intention @anubhava – CodeNoob May 08 '17 at 21:06
  • I updated the question with what I want. I understand that this will be a big problem (at least for me haha) to figure this out so it would be great if you can help me but if not I will just accept your answer and will remove the edit @anubhava – CodeNoob May 08 '17 at 21:34
  • if I try "temperature and stress" it will only match "and stress" however it should match the whole string (I didn't check the other) @anubhava – CodeNoob May 08 '17 at 22:02
  • You can check it with the example list I provided – CodeNoob May 08 '17 at 22:09
  • You still don't have clarity in your question and bringing all these requirements one by one. You need to further update the question as `temperature and stress` is NOT part of the example list and no where it says a match should be full line or just partial. Also don't forget case of `salinity and high light stress` that has 2 words before `stress` not one. – anubhava May 08 '17 at 22:11
  • I'm sorry I meant temperature and water stress, so my example list is correct. Let me clarify further; I search through articles from which I want to extract these group of words. So they shoudl match this partially in a bigger string. I really appreciate your help (I have been working on the correct regex for two days now ) @anubhava Please let me know if it is still not clear – CodeNoob May 08 '17 at 22:19
  • I need to be offline now but this regex is completely matching `temperature and water stress` – anubhava May 08 '17 at 22:25
  • No problem I really appreciate your help. I think it doesn't match: " salinity and high light stress " – CodeNoob May 08 '17 at 22:30
  • You need to check the linked demo first before writing each and every case in comments. I have written in comment as well to focus on this test case, you probably didn't skipped that part. Regex demo is matching `salinity and high light stress` but it is only a partial match because there are TWO words between `and` and `stress` however your regex allows only ONE word there. If you make input as `salinity and highlight stress` then it will match completely. – anubhava May 09 '17 at 05:12
  • 1
    I understand! Thankyou very much – CodeNoob May 09 '17 at 06:53