1

I've got a list of owner names in all caps that I'd like to convert to proper capitalization:

                   owner1
 1:    DXXXXX JOSEPH V JR
 2:          MIRNA NXXXXX
 3:          ADRIAN TXXXX
 4: CUTLER PXXXXXXXXX LLC
 5:    GVM PXXXXXXXXX LLC
 6:      EARLENA RXXXXXXX
 7:      NATHANIEL TXXXXX
 8:         DXXXXXX DONNA
 9:     LXXXX ELAINE E TR
10:      SXXXXXX KIMBERLY

(for reproduction purposes:

 owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
           "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
           "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
           "LXXXX ELAINE E TR","SXXXXXX KIMBERLY")

)

Desired output:

                   owner1
 1:   Dxxxxx Joseph V. Jr
 2:          Mirna Nxxxxx
 3:          Adrian Txxxx
 4: Cutler Pxxxxxxxxx LLC
 5:    GVM Pxxxxxxxxx LLC
 6:      Earlena Rxxxxxxx
 7:      Nathaniel Txxxxx
 8:         Dxxxxxx Donna
 9:    Lxxxx Elaine E. TR
10:      Sxxxxxx Kimberly

A big first step is a version of the .simpleCap function mentioned in ?chartr:

.simpleCap <- function(x) {
    s <- strsplit(tolower(x), " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
          sep = "", collapse = " ")
}

This is a large chunk of the problem, but fails on 4, 5 and 9. I can supplement this to treat key phrases (LLC, TR, etc.) separately, but this still leaves something like observation 5.

Here's the function I've got so far (sped up wonderfully by @eipi10's solution below, which vectorized the .simpleCap function, allowing the whole function to be applied to vectors):

to.proper<-function(strings){
  #vectorized version of .simpleCap;
  #  I've also built in that I know `strings` is all caps
  res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T)
  #In my data, some Irish/Scottish names separated the MC prefix
  #  Also, re-capitalize following a hyphen
  res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T))
  for (init in c("[A-Z]","Inc","Assoc","Co",
                 "Jr","Sr","Tr","Bros")){
    #Add a period after common abbreviations
    res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res)
  }
  for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}",
                 "Pa","Ii","Iii","Iv","Lp","Tj",
                 "Xiv","Ll","Yml","Us")){
    #Re-capitalize any string of >=3 consonants (excluding
    #   Y for such names as LYNN and WYNN), as well as
    #   some other common phrases that need upper-casing
    res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T)
  }
  #Re-capitalize post-Mc letters, e.g. in Mcmahon
  gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T)
}

Any ideas for robust-ish ways to leave potentially unpredicted abbreviations alone in this process (particularly, like those in observation 5 which are uncommon)?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 1
    I think you may need some list of suffixes to leave `LLC, TR` out of the match and not to be used in the capitalization – akrun May 22 '15 at 21:39
  • 1
    In addition to @akrun's suggestion, have you tried stri_trans_totitle() from the stringi package? – lawyeR May 22 '15 at 21:48
  • @lawyeR That should also give the same problem. I tried it :-) – akrun May 22 '15 at 21:48
  • @lawyeR is that in the development version of `stringi`? I'm not seeing it in the [documentation](http://cran.r-project.org/web/packages/stringi/stringi.pdf) – MichaelChirico May 22 '15 at 21:49
  • Yes, look in the pdf documentation on page 132 – lawyeR May 22 '15 at 21:53
  • See edit for current working function; the real trouble is the uncommon abbreviations like that in observation 5. Perhaps just check if there's three consonants in a row?? – MichaelChirico May 22 '15 at 21:55
  • @lawyeR could you link the documentation you're seeing? The link I posted is only 117 pages :o – MichaelChirico May 22 '15 at 21:57
  • Easily reproducible data would be nice. I guess in a regex question, perhaps the expectation is that folks know how to ingest that, though... – Frank May 22 '15 at 22:05

1 Answers1

2

Here's a function using a Regex to convert strings to title case (adapted from @BenBolker's answer to a question I asked on SO a while back).

The function is written so that you can pass an argument called exceptions that deals with special cases like GVM. I'm not sure if this is flexible enough for your needs, since you have to hard-code the exceptions, but I thought I'd post it and see if anyone can suggest improvements.

dat = data.frame(owner1 = c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
                                    "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
                                    "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
                                    "LXXXX ELAINE E TR","SXXXXXX KIMBERLY"))

# Convert a string to title case
tc = function(strings, exceptions="\\b(gvm)\\b") {

  # Convert to title case, excluding terminal LLC, TR, etc.
  title.case = gsub("\\b([a-zA-Z])([a-zA-Z]+)*( LLC| TR| FBO| LP)?", 
                    "\\U\\1\\L\\2\\U\\3", strings, perl=TRUE)

  # Add a period after initials (presumed to be any lone capital letter)
  title.case = gsub(" ([A-Z]) ", " \\1\\. ", title.case)

  # Deal with exceptions
  title.case = gsub(exceptions, "\\U\\1", title.case, perl=TRUE, ignore.case=TRUE)

  return(title.case)
}

dat$title.case = tc(dat$owner1)

                  owner1            title.case
1     DXXXXX JOSEPH V JR   Dxxxxx Joseph V. Jr
2           MIRNA NXXXXX          Mirna Nxxxxx
3           ADRIAN TXXXX          Adrian Txxxx
4  CUTLER PXXXXXXXXX LLC Cutler Pxxxxxxxxx LLC
5     GVM PXXXXXXXXX LLC    GVM Pxxxxxxxxx LLC
6       EARLENA RXXXXXXX      Earlena Rxxxxxxx
7       NATHANIEL TXXXXX      Nathaniel Txxxxx
8          DXXXXXX DONNA         Dxxxxxx Donna
9      LXXXX ELAINE E TR    Lxxxx Elaine E. TR
10      SXXXXXX KIMBERLY      Sxxxxxx Kimberly
Community
  • 1
  • 1
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Big props for a vectorized version of the `.simpleCap` function I was using, this sped up my code substantially. I'm eventually settled in on a function close to the one you presented. Mine is more tailor-made; to generalize it, I would probably pass `exceptions` and `initialize` arguments as well. – MichaelChirico May 23 '15 at 19:42
  • I also am using the following to figure out what sort of 2-letter consonant phrases are sitting around & going case-by-case on them: `regmatches(string,regexpr("\\b[B-DF-HJ-NP-TV-XZ]{2}\\b",string))` (unfortunately a blanket exception is inappropriate due the abundance of abbreviations like Jr, Sr, Co, Sc (School), Ch (Church) and some Vietnamese names like Ng, etc.) – MichaelChirico May 24 '15 at 18:34