I am trying to evaluate whether two named groups of words occur within 25 words of each other. I am having two issues, which are likely very interrelated:
- I am following this approach to evaluate whether certain words are near each other http://www.regular-expressions.info/near.html . The original counter appears to work, but I then want to break my code into two parts to double-check. However, when I do so, my "counter3" creates a double counting issue (i.e., counts the words purchased when it should only count sells). This is the almost the exact same question and issue as Counting presence of words within context (near other words) except using Python rather than perl.
text = "CompanyA sells Androids and Robots. Androids are then purchased and resold by Company"
counter1 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?\W+sells|purchased|resold)|sells|purchased|resold\W+(?:\w+\W+){0,25}?Androids)\b',text, re.DOTALL))
#to ensure my code is working correctly, I then want to split counter1 into two parts. However, counter 3 is giving me a double counting issues:
counter2 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?/W+sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((?:sells|purchased|resold\W+(?:\w+\W+){0,25}?/W+Androids))\b',text, re.DOTALL))
#Result: counter1= Counter({'sells': 1, 'purchased': 1, 'resold': 1})
#Result: counter2 = Counter({'purchased': 1, 'resold': 1})
#Result: counter3= Counter({'sells': 1, 'purchased': 1})
#I have also tried the below variation, which corrects counter3, but then causes an issue with counter2
counter2 = Counter(re.findall(r'\b((Androids)\W+(?:\w+\W+){0,25}?(sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((sells|purchased|resold)\W+(?:\w+\W+){0,25}?(Androids))\b',text, re.DOTALL))
#result counter2 = Counter({('Androids and Robots. Androids are then purchased', 'Androids','purchased'): 1})
#result counter3 = Counter({('sells Androids', 'sells', 'Androids'): 1})
- Next I want to create variables for the groups of words and then reference them within my regular expression. I am following this reference How to use a variable inside a regular expression?. However, I am still having issues (maybe once question 1 is answered, it will lead me to the answer for question 2)
Group1 ='Androids'
Group2 = 'sells |purchased |resold '
counter2 = Counter(re.findall(rf'\b(?:{Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b(?:{Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))
#Result - counter2 = Counter({'': 2})
#Result - counter3 = Counter({'': 2})
#interestingly, if I try an alternative variation (i.e., removing ?:), which fixed counter3 in my first question, it does not fix the issue when I try to reference the variables
counter2 = Counter(re.findall(rf'\b({Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b({Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))
#Result - counter2 = Counter({('purchased ', ''): 1, ('resold ', ''): 1})
#Result counter3 = Counter({('sells ', ''): 1, ('purchased ', ''): 1})
Any help would be fantastic, as I feel I'm going a little crazy trying different variations to make this code work! Thanks!