0

I have a sizeable data set that includes several hundred company names and looks something like this:

Name:
Earth Ltd.
Rocket International LLC
Space Corp LLC
Space Corporation LLc
Space International Corporation Ltd
Satellite Global

Some entries are just different spellings (sometimes misspellings or renaimings) or (for my purposes) the same company. I am trying to collapse these different spellings into one consistent version, e.g. Space Corp LLC, Space Corporation LLc, Space International Corporation Ltd into Space Corp. LLC.

Is there a script or package that lets me extract syntactically or otherwise similar entries, so I can see which entries I need to collapse?

Thanks a lot!

questionmark
  • 335
  • 1
  • 13

1 Answers1

0

Does this work:

corp <- c( 'Earth Ltd.', 'Rocket International LLC', 'Space Corp LLC', 'Space Corporation LLc', 'Space International Corporation Ltd', 'Satellite Global')
corp <- data.frame(name = corp)
library(stringr)
library(dplyr)
corp
                                 name
1                          Earth Ltd.
2            Rocket International LLC
3                      Space Corp LLC
4               Space Corporation LLc
5 Space International Corporation Ltd
6                    Satellite Global
 
corp %>% mutate(newcol = str_replace_all(name, 'Space Corp LLC|Space Corporation LLc|Space International Corporation Ltd', 'Space Corp. LLC'))
                                 name                   newcol
1                          Earth Ltd.               Earth Ltd.
2            Rocket International LLC Rocket International LLC
3                      Space Corp LLC          Space Corp. LLC
4               Space Corporation LLc          Space Corp. LLC
5 Space International Corporation Ltd          Space Corp. LLC
6                    Satellite Global         Satellite Global
> 
Karthik S
  • 11,348
  • 2
  • 11
  • 25