12

I have a column in a dataframe like this:

npt2$name
#  [1] "Andreas Groll, M.D."
#  [2] ""
#  [3] "Pan-Chyr Yang, PHD"
#  [4] "Suh-Fang Jeng, Sc.D"
#  [5] "Mostafa K Mohamed Fontanet Arnaud"
#  [6] "Thomas Jozefiak, M.D."
#  [7] "Medical Monitor"
#  [8] "Qi Zhu, MD"
#  [9] "Holly Posner"
# [10] "Peter S Sebel, MB BS, PhD Chantal Kerssens, PhD"
# [11] "Lance A Mynderse, M.D."
# [12] "Lawrence Currie, MD"

I tried gsub but with no luck. After doing toupper(x) I need to replace all instances of 'MD' or 'M.D.' or 'PHD' with nothing.

Is there a nice short trick to do it?

In fact I would be interested to see it done on a single string and how differently it is done in one command on the whole list.

Scarabee
  • 5,437
  • 5
  • 29
  • 55
userJT
  • 11,486
  • 20
  • 77
  • 88
  • I was hoping to avoid Regular Expressions since I can simply enumerate all bad strings to be removed. Oh my.... yet another technology (REgEx) to go back to (re-master) :-( – userJT Feb 23 '12 at 21:14
  • The field should be only last name, but the data is not consistent. Goal is to end up with only data which is either a last name or first name and remove all academic or other titles – userJT Feb 23 '12 at 21:16
  • No need to remaster it - DWin, Justin and Tommy have given all you need to know! Just copy and paste. Though regex is one of the more useful things I've learned over the years... – Matt Parker Feb 23 '12 at 21:30
  • 1
    well. but if I use some code, I need to be sure I understand it and that I do know what I am doing. – userJT Feb 23 '12 at 21:57

3 Answers3

25

Either of these:

gsub("MD|M\\.D\\.|PHD", "", test)  # target specific strings
gsub("\\,.+$", "", test)        # target all characters after comma

Both Matt Parker above and Tommy below have raised the question whether 'M.R.C.P.', 'PhD', 'D.Phil.' and 'Ph.D.' or other British or Continental designations of doctorate level degrees should be sought out and removed. Perhaps @user56 can advise what the intent was.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    @MattParker `sub` does just match the first instance, but its still `vectorized`. So it'll match the first instance in each element of the vector. – Justin Feb 23 '12 at 15:59
  • @Justin Got it, thanks! I was thinking first-in-vector instead of first-in-string. – Matt Parker Feb 23 '12 at 16:01
  • 1
    gsub might be better if there were ",MD PHD";s in there. Couldn't tell if all the other acronyms were to be deleted. – IRTFM Feb 23 '12 at 16:01
  • 2
    Element 10 in the OP has two names with PhD in them, so `gsub` is required. – Tommy Feb 23 '12 at 17:44
  • I don't really disagree, although I thought that was a data entry error and "fixed" it in my test scenario. – IRTFM Feb 23 '12 at 19:03
  • if sub is used, then gsub must be used. It must remove all instances of 'MD', not just the first instance (what sub says it does) – userJT Feb 23 '12 at 21:17
3

With a single ugly regex:

 gsub('[M,P].?D.?','',npt2$name)

Which says, find characters M or P followed by zero or one character of any kind, followed by a D and zero or one additional character. More explicitly, you could do this in three steps:

npt2$name <- gsub('MD','',npt2$name)
npt2$name <- gsub('M\\.D\\.','',npt2$name)
npt2$name <- gsub('PhD','',npt2name)

In those three, what's happening should be more straight forward. the second replacement you need to "escape" the period since its a special character.

Justin
  • 42,475
  • 9
  • 93
  • 111
  • I like the combined regex, but I think you'd need to specify an optional literal period instead of an optional any-char between the letters - consider "Brian McDonald", for example. – Matt Parker Feb 23 '12 at 15:57
  • Touche! but then you miss MD. If I were doing this data munge I would do it explicitly with one replacement per line for clarity and repeatably. (or DWin's version with logical Ors) – Justin Feb 23 '12 at 16:00
  • Would it miss MD? `gsub('[M,P]\\.?D\\.?', '', "Brian McDonald, MD")` achieves the desired effect, right? – Matt Parker Feb 23 '12 at 16:04
  • @MattParker sure does, `?` is one or zero characters. good point, early still and haven't had my coffee! – Justin Feb 23 '12 at 16:06
  • I like the 3 line one. Looks like there is a lot of golfing possible with RegEx code – userJT Feb 23 '12 at 21:58
2

Here's a variant that removes the extra ", " too. Does not require touppper either - but if you want that, just specify ignore.case=TRUE to gsub.

test <- c("Andreas Groll, M.D.", 
  "",
  "Pan-Chyr Yang, PHD",
  "Suh-Fang Jeng, Sc.D",
  "Peter S Sebel, MB BS, PhD Chantal Kerssens, PhD",
  "Lawrence Currie, MD")

gsub(",? *(MD|M\\.D\\.|P[hH]D)", "", test)
#[1] "Andreas Groll"                         ""                                     
#[3] "Pan-Chyr Yang"                         "Suh-Fang Jeng, Sc.D"                  
#[5] "Peter S Sebel, MB BS Chantal Kerssens" "Lawrence Currie"
Tommy
  • 39,997
  • 12
  • 90
  • 85