Parsing name and degree?

Question

I'm trying to parse a string containing a name and a degree. I have a long list of these. Some contain no degrees, some contain one, and some contain multiple.

Example strings:

Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.

As far as I can tell, the degrees come in the following patterns:

x.x.
x.x.x.
x.x.xx.
x.xx.
xx.x.
x.xxx.
two caps (ex: 'MA')

How would I parse this?

I'm new to regex and breaking down this problem has proved very time-consuming. I've been using this post and tried split = re.split('\s+|([.])',s) and split = re.split('\s+|\.',s) but these still split on the first space.

I have thought, in response to the first comment, about the degree designations. I've been trying to make a regex that recognizes 'x.x' and then a wildcard afterwards because there are several patterns within the degrees which look like this: x.x(something): x.x. x.x.x. x.x.xx.

and then I'd have a few more to classify.

Alternatively, classifying the name might be easier?

Or even listing the degrees in a collection and searching for them?

{'M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.", 'RN', 'B.S.Ed.'}

Perhaps you could make a regular expression that identifies degree suffixes? — GWW, Jul 02 '13 at 14:25
Don't forget DPhil - a doctorate from Oxford University, England — Bathsheba, Jul 02 '13 at 14:26
Here's Microsoft's take on it: http://support.microsoft.com/kb/168799 — lurker, Jul 02 '13 at 14:30
Lucky you. In your case I suggest you just hard-code the accreditations and OR them together in a RegEx. Solving this problem in full generality in your case is unnecessary. — Bathsheba, Jul 02 '13 at 14:37
That is indeed what I'm trying to do. And @mbratch: what language is that written in? — goldisfine, Jul 02 '13 at 14:48
MS example is (unfortunately) in Visual Basic. I cited it as an algorithm example. — lurker, Jul 02 '13 at 14:52
Got it. And yeah, they use a batch of degrees and titles rather than regexing it. — goldisfine, Jul 02 '13 at 14:59
But what is it that you're trying to achieve? You want to retrieve all the degrees? — cgledezma, Jul 05 '13 at 08:48

score 0 · Answer 1 · edited Jul 05 '13 at 12:15

Try to change your "Jr.", "Sr.", ... replacing them with something like this: "Jr~", "Sr~", ... This is the the regular expression for doing that:

/ (Jr|Sr)\. / $1~ /g

(See here )

You obtain this string:

Sam da Man J.D.
Green Eggs Jr~ Ed.M.
Argle Bargle Sr~ MA
Cersei Lannister M.A. Ph.D.

Now you can easily capture degrees with this regular expression:

/ (MA|RN|([A-Z][a-z]?[a-z]?\.)+) /g

(See here )

score 0 · Answer 2 · answered Oct 02 '13 at 14:29

0

you can use this:

'[ ](MA|RN|([A-Z][a-z]?[a-z]?\.){2,3})'

it doesn't take any word with one dot

answered Oct 02 '13 at 14:29

MIE

444
2
9

score 0 · Answer 3 · answered Jun 13 '22 at 21:25

I think the best approach is either creating a list or regex of specific degrees you're looking for, instead of trying to define patterns like x.x. that will match several different degrees. A pattern like this is too general, and may match many other values in free text (in this case, people's initials).

import re

s = """Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
Albus Dumbledore M.A.T.
"""

pattern = r"M.A.T.|Ph.D.|MA|J.D.|Ed.M.|M.A.|M.B.A.|Ed.S.|M.Div.|M.Ed.|RN|B.S.Ed."
degrees = re.findall(pattern, s, re.MULTILINE)

print(degrees)

Output:

['J.D.', 'Ed.M.', 'MA', 'M.A.', 'Ph.D.', 'M.A.T.']

If you're looking to get the names that appear between the degrees in a block of text like the one above, you can use re.split.

names = re.split(pattern, s)
names = [n.strip() for n in names if n.strip()]

print(names)

Output:

['Sam da Man', 'Green Eggs Jr.', 'Argle Bargle Sr.', 'Cersei Lannister', 'Albus Dumbledore']

Note that I had to strip the remaining strings and remove empty strings from the results to capture just the names. Doing that operation on the result allows the regex to be much simpler.

Note also that this can still fail when a specific degree could also be someone's initials, (e.g., J.D. Salinger). You may need to make adjustments or other allowances based on your real data.

Parsing name and degree?

3 Answers3