I've got a list of owner names in all caps that I'd like to convert to proper capitalization:
owner1
1: DXXXXX JOSEPH V JR
2: MIRNA NXXXXX
3: ADRIAN TXXXX
4: CUTLER PXXXXXXXXX LLC
5: GVM PXXXXXXXXX LLC
6: EARLENA RXXXXXXX
7: NATHANIEL TXXXXX
8: DXXXXXX DONNA
9: LXXXX ELAINE E TR
10: SXXXXXX KIMBERLY
(for reproduction purposes:
owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
"CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
"EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
"LXXXX ELAINE E TR","SXXXXXX KIMBERLY")
)
Desired output:
owner1
1: Dxxxxx Joseph V. Jr
2: Mirna Nxxxxx
3: Adrian Txxxx
4: Cutler Pxxxxxxxxx LLC
5: GVM Pxxxxxxxxx LLC
6: Earlena Rxxxxxxx
7: Nathaniel Txxxxx
8: Dxxxxxx Donna
9: Lxxxx Elaine E. TR
10: Sxxxxxx Kimberly
A big first step is a version of the .simpleCap
function mentioned in ?chartr
:
.simpleCap <- function(x) {
s <- strsplit(tolower(x), " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = " ")
}
This is a large chunk of the problem, but fails on 4, 5 and 9. I can supplement this to treat key phrases (LLC, TR, etc.) separately, but this still leaves something like observation 5.
Here's the function I've got so far (sped up wonderfully by @eipi10's solution below, which vectorized the .simpleCap
function, allowing the whole function to be applied to vectors):
to.proper<-function(strings){
#vectorized version of .simpleCap;
# I've also built in that I know `strings` is all caps
res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T)
#In my data, some Irish/Scottish names separated the MC prefix
# Also, re-capitalize following a hyphen
res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T))
for (init in c("[A-Z]","Inc","Assoc","Co",
"Jr","Sr","Tr","Bros")){
#Add a period after common abbreviations
res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res)
}
for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}",
"Pa","Ii","Iii","Iv","Lp","Tj",
"Xiv","Ll","Yml","Us")){
#Re-capitalize any string of >=3 consonants (excluding
# Y for such names as LYNN and WYNN), as well as
# some other common phrases that need upper-casing
res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T)
}
#Re-capitalize post-Mc letters, e.g. in Mcmahon
gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T)
}
Any ideas for robust-ish ways to leave potentially unpredicted abbreviations alone in this process (particularly, like those in observation 5 which are uncommon)?