1

I am trying to evaluate whether two named groups of words occur within 25 words of each other. Below, I have included a simple way to look for certain words within 25 words of each other, similar to this post (Regular Expression: Find words within 10 words of each other ). However, I would like to reference groups of words, which contain about 500 word variations each, within the regular expression. I do not want to list each word in the code, if possible.

#What I think I can do:
text = "CompanyA sells Andriods and Robots. Andriods are purchased and then resold by CompanyB."

counter = Counter(re.findall(r'\b(sells|purchased|resold(?:\W+\w+){0,25}?\W+Androids | Androids(?:\W+\w+){0,25}?\W+sells|purchased|resold)\b', text, re.DOTALL))
sumcounter = sum(counter.values())

#Result = 3 
#What I want to do: 
Group1 = '/b(Androids)/b'
Group2 = '/b(sells|purchases|resold)/b'
#I am not sure I am formatting the groups correctly, but I've tried a few different ways to know avail (e.g., excluding the \b word boundary or creating a list Group2 =  ['sells', 'purchases', 'resold']

counter2 = Counter(re.findall(r'(?P<Group2>)(?:\W+\w+){0,25}?\W+(?P<Group1>) | (?P<Group1>)(?:\W+\w+){0,25}?\W+(?P<Group2>))', text, re.DOTALL))
sumcounter2 = sum(counter2.values())
#error: redefinition of group name 'Group1' as group 3; was group 2 at position 51

#However, this alternative also does not work
counter2 = Counter(re.findall(r'((?P<Group1>)\(?:\W+\w+){0,25}?\W+(?P<Group2>)', text, re.DOTALL))
sumcounter2 = sum(counter2.values())
#This codes appears to capture the number of letters, rather than recognizing different words
#Result = 13

#Other Failed attempts
#counter2 = Counter(re.findall(r'\b(?P<Group2>)(?:\W+\w+){0,25}?\W+(?P<Group1>) | (?P<Group1>)(?:\W+\w+){0,25}?\W+(?P<Group2>)\b', text, re.DOTALL))
#counter2 = Counter(re.findall(r'\b(?P='Group2')(?:\W+\w+){0,25}?\W+(?P='Group1') | (?P='Group1')(?:\W+\w+){0,25}?\W+(?P='Group2')\b', text, re.DOTALL))
#counter2 = Counter(re.findall(r'\b(?P="Group2")(?:\W+\w+){0,25}?\W+(?P="Group1") | (?P="Group1")(?:\W+\w+){0,25}?\W+(?P="Group2")\b', text, re.DOTALL))

Edit: I've also tried the below in response to the following post: Named regular expression group "(?P<group_name>regexp)": what does "P" stand for?

This seems to be getting me closer to my answer, but my main code still receives an error message.

: #The below appears to work: 
counter2 = Counter(re.findall(r'((?P<Group1>.*)(?:\W+\w+){0,25}?\W+(?P<Group2>.*)', text, re.DOTALL)) 
sumcounter2 = sum(counter2.values()) 
#Result = 1 

#However, I'm still receiving an error here: 
counter = Counter(re.findall(r'((?P<Group1>.*)(?:\W+\w+){0,25}?\W+(?P<Group2>.*) | (?P<Group2>.*)(?:\W+\w+){0,25}?\W+(?P<Group1>.*))\b', text, re.DOTALL)) 
sumcounter = sum(counter.values())

#error: redefinition of group name 'Group2' as group 4; was group 3 at position 56
Grant
  • 23
  • 5

0 Answers0