0

I'm recently dealing with a project which need to combine two data sets by company names. However, company names in one data set(say A) is only abbreviated form of their full in another (say B), for example "A T T" or "AM TEL & TEL "in data set A, "AMERICAN TELEPHONE & TELEG CO" in B.

my first try is to break the name in both data sets by white spaces and take the first letter of each broke pieces then match them but failed for not found way to break string by white spaces.

I also tried grepl and grep, but it only worked for string without white space and the pattern must be given.

may be this could be done use some regular exp technique but I still didn't find a way to complete this until I write this post.

Could this task be done by R? if yes, how? below is some data from my data sets.

structure(list(abbreviated = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 26L, 27L, 27L, 28L, 29L, 
30L, 31L, 32L, 49L, 60L, 51L, 52L, 33L, 34L, 35L, 36L, 37L, 38L, 
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 50L, 53L, 54L, 
56L, 57L, 58L, 55L, 59L, 61L), .Label = c("20 20 SPORT", "20TH CENTRY", 
"20th Century Fox", "20TH CENTY", "21ST CENTY TELECOM GROUP INC", 
"238 Telecom Limited", "24 7", "24 7 MEDIA INC", "24 7 Real Media Inc", 
"247Media Inc", "360 COMMUN", "360NETWORKS INC", "3C COMM INTL", 
"3COM", "3Com Corp", "3COM Corp", "3COM CORP", "3D COMMUN", "3D Industrial Electronics PTE", 
"3Dfx", "3m", "3M", "3M Co", "3M CO", "3M Corporation", "3M Unitek", 
"3M UNITEK", "3SBio Inc", "7 Eleven  Inc ", "7 Eleven Inc", "7 ELEVEN INC", 
"A   C WHSL", "A 1 International Inc", "A 1 INTL INC", "A 1 LEASING", 
"A 2 Z STORES", "A A FOODS", "A A R P", "A B DICK", "A B DRACO", 
"A C COIN SLOT", "A D", "A D  KRAUTH", "A D I", "a e", "A E TELEVISION NTWK", 
"A G A BURDOX", "A G I P S P A", "A G INC", "A H ROBINS", "A Lassonde Inc ", 
"A P", "A S Dampskibsselskabet Torm", "A STURM   SON", "A T Clayton   Co", 
"A T T", "A T T Corp", "A T T CORP", "A T T TECH", "A W RESTRNT", 
"A123 Systems"), class = "factor"), full = structure(c(1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 18L, 19L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), .Label = c("20TH CENTURY ENERGY CORP", "20TH CENTURY INDUSTRIES", 
"20TH CENTURY INDUSTRIES CA", "21ST CENTURY DISTRIBUTION CORP", 
"21ST CENTURY FILMS CORP", "21ST CENTURY HOLDING CO", "21ST CENTURY INSURANCE GROUP", 
"21ST CENTURY ROBOTICS", "24 7 MEDIA INC", "24 7 REAL MEDIA INC", 
"360 COMMUNICATIONS CO", "360NETWORKS INC", "3COM CORP", "3DFX INTERACTIVE INC", 
"3M CO", "3SBIO INC", "7 ELEVEN INC", "A   A FOODS LTD", "ROBINS A H INC"
), class = "factor")), .Names = c("abbreviated", "full"), row.names = c(NA, 
63L), class = "data.frame")

Any suggestions would be deeply appreciated. Thanks in advance.

Jia Gao
  • 1,172
  • 3
  • 13
  • 26
  • Look into the `abbreviate` function: `?abbreviate`. – Abdou Feb 23 '17 at 03:21
  • 1
    What you've got is essentially a probabilistic matching exercise. You might want to look into comparing the strings using a string similarity measure. For instance, the `?adist` function or the `stringdist` package. – thelatemail Feb 23 '17 at 03:29
  • I tried this but since it need to specify the minlength and there are too many of the names that I can't predetermine this parameter ,say `abbreviate("A T T")` we get :A T T , ` abbreviate("AM TEL & TEL")`, we get : "AT&T" , or `abbreviate("AM TEL TEL")`, we get "AMTT". all of the three actually are the same thing but abbreviate give different result. – Jia Gao Feb 23 '17 at 03:31
  • 1
    You may have to do some [fuzzy/stringdist](https://github.com/markvanderloo/stringdist) matching here. – Abdou Feb 23 '17 at 03:31
  • @JasonGoal - removing any non-A-Z and 0-9 values and removing or standardising short words like AND / & first may make your task a bit easier too. – thelatemail Feb 23 '17 at 03:33
  • thanks thelatemail , yeah that 's a good way . and for the suggestion you made in the previous comment, that's a way to do it but the matching set will be small (for example we sue method== "Soundex") directly on the original names ) , it's better to perform it on the abbreviated form like "A T T" not "AM TEL & TEL", so it's back to the ancient question about how to abbreviate them first. R should have the ability to separate string into sub strings by white space , I just didn't find it yet, any suggestions? – Jia Gao Feb 23 '17 at 03:48
  • @ Abdou, yeah that's a workable way, but the data set clearly claimed that the names in one data set are exactly the abbreviation of names in another data set, which means all letters appeared in the abbreviated form should appear in the full form. I postulate that match on abbreviation result in higher accuracy and larger sample size . Which is definitely a good thing – Jia Gao Feb 23 '17 at 03:53
  • @JasonGoal - you could use some incantation of `strsplit` - like `x <- c("A T T", "AM TEL & TEL", "AM TEL TEL"); sapply(strsplit(gsub("[^[:alnum:] ]", "", x), "\\s+"), function(x) paste(substr(x, 1, 1),collapse="") )` for instance. There's probably a neat regex way too but I can't get it to work. – thelatemail Feb 23 '17 at 04:29
  • great idea, I'll try best to make it work – Jia Gao Feb 23 '17 at 04:55
  • Maybe duplicate of: http://stackoverflow.com/q/17017024/2372064 or any of the other "fuzzy matching" questions on stack overflow. This is not a simple problem. – MrFlick Feb 23 '17 at 05:33
  • yeah ,it's a similar question, but fuzzy matching is not what I want since I know exactly that one is the abbreviation of another , trying to work out another solution, I'll post it here if it works good. – Jia Gao Feb 24 '17 at 02:27

0 Answers0