1

I am trying to evaluate whether two named groups of words occur within 25 words of each other. I am having two issues, which are likely very interrelated:

  1. I am following this approach to evaluate whether certain words are near each other http://www.regular-expressions.info/near.html . The original counter appears to work, but I then want to break my code into two parts to double-check. However, when I do so, my "counter3" creates a double counting issue (i.e., counts the words purchased when it should only count sells). This is the almost the exact same question and issue as Counting presence of words within context (near other words) except using Python rather than perl.
text = "CompanyA sells Androids and Robots. Androids are then purchased and resold by Company"

counter1 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?\W+sells|purchased|resold)|sells|purchased|resold\W+(?:\w+\W+){0,25}?Androids)\b',text, re.DOTALL))

#to ensure my code is working correctly, I then want to split counter1 into two parts. However, counter 3 is giving me a double counting issues: 
counter2 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?/W+sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((?:sells|purchased|resold\W+(?:\w+\W+){0,25}?/W+Androids))\b',text, re.DOTALL))

#Result: counter1= Counter({'sells': 1, 'purchased': 1, 'resold': 1})
#Result: counter2 = Counter({'purchased': 1, 'resold': 1})
#Result: counter3= Counter({'sells': 1, 'purchased': 1})


#I have also tried the below variation, which corrects counter3, but then causes an issue with counter2
counter2 = Counter(re.findall(r'\b((Androids)\W+(?:\w+\W+){0,25}?(sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((sells|purchased|resold)\W+(?:\w+\W+){0,25}?(Androids))\b',text, re.DOTALL))

#result counter2 = Counter({('Androids and Robots. Androids are then purchased',           'Androids','purchased'): 1})
#result counter3 = Counter({('sells Androids', 'sells', 'Androids'): 1})

  1. Next I want to create variables for the groups of words and then reference them within my regular expression. I am following this reference How to use a variable inside a regular expression?. However, I am still having issues (maybe once question 1 is answered, it will lead me to the answer for question 2)
Group1 ='Androids'
Group2 = 'sells |purchased |resold '

counter2 = Counter(re.findall(rf'\b(?:{Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b(?:{Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))


#Result - counter2 = Counter({'': 2})
#Result - counter3 = Counter({'': 2})

#interestingly, if I try an alternative variation (i.e., removing ?:), which fixed counter3 in my first question, it does not fix the issue when I try to reference the variables 

counter2 = Counter(re.findall(rf'\b({Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b({Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))

#Result - counter2 = Counter({('purchased ', ''): 1, ('resold ', ''): 1})
#Result counter3 = Counter({('sells ', ''): 1, ('purchased ', ''): 1})

Any help would be fantastic, as I feel I'm going a little crazy trying different variations to make this code work! Thanks!

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Grant
  • 23
  • 5
  • You have both 'Andriods' and 'Androids' in `text`. Is that intentional? You have all of this code but you *never* really state in **English** what it is you are actually trying to count ("counts the words purchased" is a bit vague) and what you expect the output to be. If you are trying to match 'Andriods' (note the spelling) separated by one of ('sells', 'purchased', 'resold') in either order in the `text` string, then there is only one match, namely 'sells Andriods' so why do you have `Group1 = 'Androids'` (note the spelling) in part 2? And why would you expect to see the word 'purchased'? – Booboo Jan 15 '22 at 13:35
  • You see 'purchased' only because your regex is incorrect if indeed you are looking for one of `('sells', 'purchased', 'resold)`. Instead of **sells|purchased|resold**, you should have **(?:sells|purchased|resold)** – Booboo Jan 15 '22 at 13:38
  • Thanks for looking into this! When I change my regex to included the (?:sells|purchased|resold), then the counter ends up being empty; (Result = Counter() counter3 = Counter(re.findall(r'\b((?:sells|purchased|resold)\W+(?:\w+\W+){0,25}(?:Androids))\b',text, re.DOTALL)) – Grant Jan 15 '22 at 22:35
  • Also, thanks for catching my spelling error. I've updated the code to use "Androids" throughout...the results did not change, which tells me I have more problems than I realized. For counter 2, I am expecting Counter to = "purchased", "resold", because the word Androids occurs before purchased and resold. Then for counter 3, I am expecting for the Counter = "sells", because the word sells occurs before the word Androids. – Grant Jan 15 '22 at 22:41
  • I think you should be updating your question. If you are looking for either 'sells' or 'purchased' or 'resold' **say so** in English and don't make us guess this from your faulty regex. If not, still say what you are trying to match. But I do believe your regex does not fit the pattern of the link you reference. – Booboo Jan 15 '22 at 22:55

1 Answers1

1

If you are looking for 'Androids' separated by one of 'sells' or 'purchased' or 'resold' within 25 words, then the following will find all the matches and give you a count for all matches of the words that span the matches. If you want something different, then you should say what you want in plain English (this is based strictly on the link the OP provided with simple but logical substitutions):

import re
from collections import Counter

text = "CompanyA sells Androids and Robots. Androids are then purchased and resold by Company"
regex = r'\b(?:Androids\W+(?:\w+\W+){0,25}?(?:sells|purchased|resold)|(?:sells|purchased|resold)\W+(?:\w+\W+){0,25}?Androids)\b'
matches = re.findall(regex, text)
print(matches)
c = Counter()
for match in matches:
    c.update(match.split())
print(c)

Prints:

['sells Androids', 'Androids are then purchased']
Counter({'Androids': 2, 'sells': 1, 'are': 1, 'then': 1, 'purchased': 1})

When you insert what you are looking for into the pattern provided by the link, which was designed for single word matches, because you have an "or" situation (that is a choice of words that satisfy a match), you must, because of precedence, use parentheses around the group of words that are separated by |. And so as to not introduce an additional capture group, it must be a non-capturing parentheses, i.e. (?: ... ).

Now if you want to count things differently, use this as a starting point. But be aware of what happens when you start adding capturing groups as to how if affects the findall method.

Booboo
  • 38,656
  • 3
  • 37
  • 60