I am working on a large file (over 2 million rows) in which I would like to remove all titles and suffixes (personal and/or professional) from each of the strings. As you will see from the small test case below, the titles and suffixes appear at different positions with each string.
I have used parts of answers from the following 3 questions:
Negative lookahead on Regex Pattern
regular expression for exact match of a word
How to search for multiple strings and replace them with nothing within a list of strings
test <- c("pan-chr ii", "true ii.", "mr. and mrs panjii", "pans iv prof",
"md trs iv.", "iipan", "a c iii miss clark", "a c iv jones mrs",
"a c jones iv", "a c jr huffman phd.", "a c jr markkula",
"a c sr. goldtrap", "mr & mrs prof dr. a c cjdr iv, esq.",
"false mr petty phd", "abe jr esquibel phd",
"md reginald r dr esquire garcia", "laurence curry, md",
"lawrence mcdonald md phd", "mdonald mr and mrs sebelmd dr jr md phd",
"(van) der walls")
# test
# [1] "pan-chr ii"
# [2] "true ii."
# [3] "mr. and mrs panjii"
# [4] "pans iv prof"
# [5] "md trs iv."
# [6] "iipan"
# [7] "a c iii miss clark"
# [8] "a c iv jones mrs"
# [9] "a c jones iv"
# [10] "a c jr huffman phd."
# [11] "a c jr markkula"
# [12] "a c sr. goldtrap"
# [13] "mr & mrs prof dr. a c cjdr iv, esq."
# [14] "false mr petty phd"
# [15] "abe jr esquibel phd"
# [16] "md reginald r dr esquire garcia"
# [17] "laurence curry, md"
# [18] "lawrence mcdonald md phd"
# [19] "mdonald mr and mrs sebelmd dr jr md phd"
# [20] "(van) der walls"
testresult <- gsub(",? *(mister|sir|madam|mr\\.|mr|mrs\\.|mrs|ms\\.|
mr\\. and mrs\\.|mr and mrs|mr\\. and mrs|mr and mrs\\.|
mr\\. & mrs\\.|mr & mrs|mr\\. & mrs|mr & mrs\\.|& mrs\\.|and mrs\\.|
and mrs\\.|& mrs|and mrs|ms|miss\\.|miss|prof\\.|prof|professor|
doctor|md|md\\.|m\\.d\\.|dr\\.|dr|phd|phd\\.|esq\\.|esq|esquire|
i{2,3}|i{2,3}\\.|iv|iv\\.|jr|jr\\.|sr|sr\\.|\\(|\\))(?![\\w\\d])", "",
test, perl = TRUE)
# testresult
# [1] "pan-chr" "true."
# [3] " panj" "pans"
# [5] " trs." "iipan"
# [7] "a c clark" "a c jones"
# [9] "a c jones" "a c huffman."
# [11] "a c markkula" "a c. goldtrap"
# [13] " a c cj" "false petty"
# [15] "abe esquibel" " reginald r garcia"
# [17] "laurence curry" "lawrence mcdonald"
# [19] "mdonald sebel" "(van der walls"
1) How should the regular expression expressed in testresult be revised to achieve the following result?
2) Is there a faster option than using gsub
since I have a file with > 2 million rows?
Thank you.
# testresult that I want to have
# [1] "pan-chr" "true"
# [3] "panjii" "pans"
# [5] "trs" "iipan"
# [7] "a c clark" "a c jones"
# [9] "a c jones" "a c huffman"
# [11] "a c markkula" "a c goldtrap"
# [13] "a c cjdr" "false petty"
# [15] "abe esquibel" "reginald r garcia"
# [17] "laurence curry" "lawrence mcdonald"
# [19] "mdonald sebelmd" "van der walls"