0

I am working on a large file (over 2 million rows) in which I would like to remove all titles and suffixes (personal and/or professional) from each of the strings. As you will see from the small test case below, the titles and suffixes appear at different positions with each string.

I have used parts of answers from the following 3 questions:

Negative lookahead on Regex Pattern

regular expression for exact match of a word

How to search for multiple strings and replace them with nothing within a list of strings

test <- c("pan-chr ii", "true ii.", "mr. and mrs panjii", "pans iv prof",
"md trs iv.", "iipan", "a c iii miss clark", "a c iv jones mrs",
"a c jones iv", "a c jr huffman phd.", "a c jr markkula",
"a c sr. goldtrap", "mr & mrs prof dr. a c cjdr iv, esq.",
"false mr petty phd", "abe jr esquibel phd",
"md reginald r dr esquire garcia", "laurence curry, md",
"lawrence mcdonald md phd", "mdonald mr and mrs sebelmd dr jr md phd",
 "(van) der walls")

# test
# [1] "pan-chr ii"                                   
# [2] "true ii."                                     
# [3] "mr. and mrs panjii"                           
# [4] "pans iv prof"                                 
# [5] "md trs iv."                                   
# [6] "iipan"                                        
# [7] "a c iii miss clark"                           
# [8] "a c iv jones mrs"                             
# [9] "a c jones iv"                                 
# [10] "a c jr huffman phd."                          
# [11] "a c jr markkula"                              
# [12] "a c sr. goldtrap"                             
# [13] "mr & mrs prof dr. a c cjdr iv, esq."          
# [14] "false mr petty phd"                           
# [15] "abe jr esquibel phd"                          
# [16] "md reginald r dr esquire garcia"              
# [17] "laurence curry, md"                           
# [18] "lawrence mcdonald md phd"                     
# [19] "mdonald mr and mrs sebelmd dr jr md phd"
# [20] "(van) der walls"

testresult <- gsub(",? *(mister|sir|madam|mr\\.|mr|mrs\\.|mrs|ms\\.|
mr\\. and mrs\\.|mr and mrs|mr\\. and mrs|mr and mrs\\.|
mr\\. & mrs\\.|mr & mrs|mr\\. & mrs|mr & mrs\\.|& mrs\\.|and mrs\\.|
and mrs\\.|& mrs|and mrs|ms|miss\\.|miss|prof\\.|prof|professor|
doctor|md|md\\.|m\\.d\\.|dr\\.|dr|phd|phd\\.|esq\\.|esq|esquire|
i{2,3}|i{2,3}\\.|iv|iv\\.|jr|jr\\.|sr|sr\\.|\\(|\\))(?![\\w\\d])", "",
test, perl = TRUE)

# testresult
# [1] "pan-chr"                    "true."                     
# [3] " panj"                      "pans"                      
# [5] " trs."                      "iipan"                     
# [7] "a c clark"                  "a c jones"                 
# [9] "a c jones"                  "a c huffman."              
# [11] "a c markkula"               "a c. goldtrap"             
# [13] " a c cj"                    "false petty"               
# [15] "abe esquibel"               " reginald r garcia"
# [17] "laurence curry"             "lawrence mcdonald"         
# [19] "mdonald sebel"              "(van der walls"

1) How should the regular expression expressed in testresult be revised to achieve the following result?

2) Is there a faster option than using gsub since I have a file with > 2 million rows?

Thank you.

# testresult that I want to have
# [1] "pan-chr"                       "true"                        
# [3] "panjii"                        "pans"                         
# [5] "trs"                           "iipan"                        
# [7] "a c clark"                     "a c jones"                    
# [9] "a c jones"                     "a c huffman"                 
# [11] "a c markkula"                 "a c goldtrap"                
# [13] "a c cjdr"                     "false petty"                  
# [15] "abe esquibel"                 "reginald r garcia"
# [17] "laurence curry"               "lawrence mcdonald"         
# [19] "mdonald sebelmd"              "van der walls"  
Community
  • 1
  • 1
iembry
  • 962
  • 1
  • 7
  • 23
  • why do you want esquire in the last one? also why not just use a perl script rather than r if speed is a concern? – rawr Apr 16 '15 at 16:18
  • and in your update, why doyou want sc.d and not phd (and others)? and the comma in that one – rawr Apr 16 '15 at 16:33
  • @rawr Thank you for your comments. I have updated my question. I need to have this code written in R and not in any other language. – iembry Apr 16 '15 at 16:46
  • do you want package solutions? see stringr or stringi packages – rawr Apr 16 '15 at 16:47
  • @rawr Any R solution will do (base, stringr, stringi, etc.) as long as it can return the proper results and work on > 2 million rows as quickly as possible. – iembry Apr 16 '15 at 16:50

1 Answers1

1

I don't think creating one regex for all cases is the best approach to this. I spent some time trying, and you'll problems with spaces since you have titles at the beggining, at the end and in sequence in your strings. Ultimately, if you remove all titles properly you'll probably bind some names (at least it happened to me) and have some multiple spaces, which would require further gsubs to remove. You are also more prone to error since I'm sure you won't be able to check the 2m rows for specific cases and combinations.

I suggest a different approach, which is certainly slower than the perfect regex, if that's even possible, but results are more reliable. You can split your string using some delimiters, remove the slices you're not interested in and then bind the rest back. Like this:

test.split <- strsplit(test, "\\s|\\.|\\,|\\(|\\)") #Split by empty spaces, dots, commas and parenthesis

titles <- c("mr", "mrs", "iv", "md", "phd", "iii", "ii", "and", "&", "miss", "jr", "sr", "iv", "prof", "professor", "esquire", "dr", "esq", "sc", "d", "") #Everything you want to remove that isn't a separator above should be here

test.clear <- sapply(test.split, function(st) paste(st[!(st %in% titles)], collapse=" "), USE.NAMES=FALSE)
test.clear
[1] "pan-chr"            "true"               "  panjii"           "pans "             
 [5] "trs "               "iipan"              "a c clark"          "a c  joness"       
 [9] "a c jones "         "a c  huffman"       "a c  markkula"      "a c  goldtrap"     
[13] " a c cjdr  "        "false petty"        "abe  esquibel"      "reginald r  garcia"
[17] "laurence curry"     "lawrence mcdonald"  "mdonald   sebelmd " "van der walls"     

To optmize, you can use package stringi to split:

library(stringi)
test.split <- stri_split(test, regex="\\s|\\.|\\,|\\(|\\)")

Performance:

> system.time(replicate(10000, strsplit(test, "\\s|\\.|\\,|\\(|\\)"))) #base
     user    system  ellapsed
     1.99      0.00      2.01 
> system.time(replicate(10000, str_split(test, "\\s|\\.|\\,|\\(|\\)"))) #package stringr
     user    system  ellapsed
    21.97      0.03     25.39 
> system.time(replicate(10000, stri_split(test, regex="\\s|\\.|\\,|\\(|\\)"))) #package stringi
     user    system  ellapsed
     0.78      0.00      0.78 

I wouldn't use any of the packages for paste() though, since base is faster:

> system.time(replicate(50000, paste(letters[1:5])))
     user    system  ellapsed 
     0.28      0.00      0.28 
> system.time(replicate(50000, str_join(letters[1:5])))
     user    system  ellapsed 
     1.72      0.00      1.75 
> system.time(replicate(50000, stri_join(letters[1:5])))
     user    system  ellapsed 
     0.38      0.00      0.39 
Molx
  • 6,816
  • 2
  • 31
  • 47
  • Thank you for your solution. Based on efficiency, this is the resulting solution to match **testresults**: `test.split <- stri_split(test, regex = "\\s|\\.|\\,|\\(|\\)")` `titles <- c("mister", "sir", "mr", "madam", "mrs", "miss", "ms", "iv", "iii", "ii", "jr", "sr", "md", "phd", "and", "&", "prof", "professor", "esquire", "esq", "dr", "doctor")` `test.clear <- sapply(test.split, function(st) paste(st[!(st %in% titles)], collapse = " "), USE.NAMES = FALSE)` `test.clear <- gsub(",", "", test.clear, fixed = TRUE)` `test.clear <- stri_trim_both(test.clear, pattern = "\\P{Wspace}")` – iembry Apr 16 '15 at 22:46
  • `test.clear <- stri_replace_all_regex(test.clear, " {2,100}", " ")` # used http://stackoverflow.com/questions/24244709/how-to-convert-data-from-pdf-files-into-data-frames-in-r/24253360#24253360 for the spaces greater than 2 – iembry Apr 16 '15 at 22:46