I have a column of company names.
db_name$name
I've found the 100 most common endings (Inc, Ltd, GmbH, Co, etc.), and concatenated them to make them easier to use with Regular Expressions.
`db_name$ending <- word(db_name$name,-1)
db_end_count <- data.frame(table(db_name$ending)) %>%
arrange(desc(Freq)) %>%
filter(row_number()<=100)
db_end <- str_c(db_end_count$Var1,"", collapse = "|")`
I'd like to remove these common endings from the end of each of the strings, while not removing them from the interior words ('Communications Co' not becoming 'mmunications '), and also keeping the company names that only consist of one word.
The solution I've been experimenting with I derived from here: R remove last word from string, which basically says, gsub("\\s*\\w*$", "", db_name$name)
, except I've been replacing \\w with my vector of 100 most common endings, using the rebus package. However, every different form I try (with or without the * or the \\s) results in one of the issues I described above (truncated words, omission of whole words).
Could someone suggest a way I could remove the most common company endings from the end of the company anem strings, either in the way I've done it so far, or something even more clever? Thanks!