0

SAMPLE CODE

import re
line = "should we use regex more often, University of Pennsylvania. let me know at  321dsasdsa@dasdsa.com.lol"
match = re.search(r'/([A-Z][^\s,.]+[.]?\s[(]?)*(Hospital|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/', line)
print(match.group(0))

I'm trying to extract University/School/Organization names from given string using regular expression in python but it gives an error message.

ERROR MESSAGE

Traceback (most recent call last): File "C:/Python/addOrganization.py", line 4, in print(match.group(0)) AttributeError: 'NoneType' object has no attribute 'group'

vinay nischal
  • 51
  • 2
  • 8

2 Answers2

0

Instead of search ,Try the re.sub to print your expected output

import re
i = "should we use regex more often, University of Pennsylvania. let me know at  321dsasdsa@dasdsa.com.lol"
line = re.sub(r"[\w\W]* ((Hospital|University|Centre|Law School|School|Academy|Department)[\w -]*)[\w\W]*$", r"\1", i)
print line
pavithran G
  • 112
  • 2
  • 13
0

The test string you've given is a made up one since the University name is immediately followed by a line terminator '.' while the other examples in your pastebin sample do not (they are followed by a comma).

line = should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol

I have managed to extract the names using a simple regex for examples in your pastebin you can see details here: regex101.com

Logic

Since the institute name is separated by a comma (except the first case where it starts with the university name), you can see that the match string will either lie in group1 or group2.

Then you can iterate through group1 & group2to see if it matches anything in the pre-defined match list & return the value.

Code

I have used two examples to show it works.

line1 = 'The George Washington University, Washington, DC, USA.'
line2 = 'Department of Pathology, University of Oklahoma Health Sciences Center, Oklahoma City, USA. adekunle-adesina@ouhsc.edu'

matchlist = ['Hospital','University','Institute','School','School','Academy'] # define all keywords that you need look up
p = re.compile('^(.*?),\s+(.*?),(.*?)\.')   # regex pattern to match

# We use a list comprehension using 'any' function to check if any of the item in the matchlist can be found in either group1 or group2 of the pattern match results
line1match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line1)]
line2match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line2)]

print (line1match)
[Out]: ['The George Washington University']

print (line2match)
[Out]: ['University of Oklahoma Health Sciences Center']
ParvBanks
  • 1,316
  • 1
  • 9
  • 15
  • The test string you've given is a made up one "should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol" In all other strings the line never ends with the university name, so this one doesn't match the regex you build for your other pastebin examples. – ParvBanks Dec 06 '18 at 10:35