5

I'm trying to parse a string containing a name and a degree. I have a long list of these. Some contain no degrees, some contain one, and some contain multiple.

Example strings:

Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D. 

As far as I can tell, the degrees come in the following patterns:

x.x.
x.x.x.
x.x.xx.
x.xx.
xx.x.
x.xxx.
two caps (ex: 'MA')

How would I parse this?

I'm new to regex and breaking down this problem has proved very time-consuming. I've been using this post and tried split = re.split('\s+|([.])',s) and split = re.split('\s+|\.',s) but these still split on the first space.

I have thought, in response to the first comment, about the degree designations. I've been trying to make a regex that recognizes 'x.x' and then a wildcard afterwards because there are several patterns within the degrees which look like this: x.x(something): x.x. x.x.x. x.x.xx.

and then I'd have a few more to classify.

Alternatively, classifying the name might be easier?

Or even listing the degrees in a collection and searching for them?

{'M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.", 'RN', 'B.S.Ed.'}
Community
  • 1
  • 1
goldisfine
  • 4,742
  • 11
  • 59
  • 83

3 Answers3

0

Try to change your "Jr.", "Sr.", ... replacing them with something like this: "Jr~", "Sr~", ... This is the the regular expression for doing that:

/ (Jr|Sr)\. / $1~ /g

(See here )

You obtain this string:

Sam da Man J.D.
Green Eggs Jr~ Ed.M.
Argle Bargle Sr~ MA
Cersei Lannister M.A. Ph.D. 

Now you can easily capture degrees with this regular expression:

/ (MA|RN|([A-Z][a-z]?[a-z]?\.)+) /g

(See here )

Yossarian
  • 5,226
  • 1
  • 37
  • 59
fazen
  • 61
  • 7
0

you can use this:

'[ ](MA|RN|([A-Z][a-z]?[a-z]?\.){2,3})'

it doesn't take any word with one dot

MIE
  • 444
  • 2
  • 9
0

I think the best approach is either creating a list or regex of specific degrees you're looking for, instead of trying to define patterns like x.x. that will match several different degrees. A pattern like this is too general, and may match many other values in free text (in this case, people's initials).

import re

s = """Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
Albus Dumbledore M.A.T.
"""

pattern = r"M.A.T.|Ph.D.|MA|J.D.|Ed.M.|M.A.|M.B.A.|Ed.S.|M.Div.|M.Ed.|RN|B.S.Ed."
degrees = re.findall(pattern, s, re.MULTILINE)

print(degrees)

Output:

['J.D.', 'Ed.M.', 'MA', 'M.A.', 'Ph.D.', 'M.A.T.']

If you're looking to get the names that appear between the degrees in a block of text like the one above, you can use re.split.

names = re.split(pattern, s)
names = [n.strip() for n in names if n.strip()]

print(names)

Output:

['Sam da Man', 'Green Eggs Jr.', 'Argle Bargle Sr.', 'Cersei Lannister', 'Albus Dumbledore']

Note that I had to strip the remaining strings and remove empty strings from the results to capture just the names. Doing that operation on the result allows the regex to be much simpler.

Note also that this can still fail when a specific degree could also be someone's initials, (e.g., J.D. Salinger). You may need to make adjustments or other allowances based on your real data.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880