I'm recently dealing with a project which need to combine two data sets by company names. However, company names in one data set(say A) is only abbreviated form of their full in another (say B), for example "A T T" or "AM TEL & TEL "in data set A, "AMERICAN TELEPHONE & TELEG CO" in B.
my first try is to break the name in both data sets by white spaces and take the first letter of each broke pieces then match them but failed for not found way to break string by white spaces.
I also tried grepl and grep, but it only worked for string without white space and the pattern must be given.
may be this could be done use some regular exp technique but I still didn't find a way to complete this until I write this post.
Could this task be done by R? if yes, how? below is some data from my data sets.
structure(list(abbreviated = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 26L, 27L, 27L, 28L, 29L,
30L, 31L, 32L, 49L, 60L, 51L, 52L, 33L, 34L, 35L, 36L, 37L, 38L,
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 50L, 53L, 54L,
56L, 57L, 58L, 55L, 59L, 61L), .Label = c("20 20 SPORT", "20TH CENTRY",
"20th Century Fox", "20TH CENTY", "21ST CENTY TELECOM GROUP INC",
"238 Telecom Limited", "24 7", "24 7 MEDIA INC", "24 7 Real Media Inc",
"247Media Inc", "360 COMMUN", "360NETWORKS INC", "3C COMM INTL",
"3COM", "3Com Corp", "3COM Corp", "3COM CORP", "3D COMMUN", "3D Industrial Electronics PTE",
"3Dfx", "3m", "3M", "3M Co", "3M CO", "3M Corporation", "3M Unitek",
"3M UNITEK", "3SBio Inc", "7 Eleven Inc ", "7 Eleven Inc", "7 ELEVEN INC",
"A C WHSL", "A 1 International Inc", "A 1 INTL INC", "A 1 LEASING",
"A 2 Z STORES", "A A FOODS", "A A R P", "A B DICK", "A B DRACO",
"A C COIN SLOT", "A D", "A D KRAUTH", "A D I", "a e", "A E TELEVISION NTWK",
"A G A BURDOX", "A G I P S P A", "A G INC", "A H ROBINS", "A Lassonde Inc ",
"A P", "A S Dampskibsselskabet Torm", "A STURM SON", "A T Clayton Co",
"A T T", "A T T Corp", "A T T CORP", "A T T TECH", "A W RESTRNT",
"A123 Systems"), class = "factor"), full = structure(c(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 19L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), .Label = c("20TH CENTURY ENERGY CORP", "20TH CENTURY INDUSTRIES",
"20TH CENTURY INDUSTRIES CA", "21ST CENTURY DISTRIBUTION CORP",
"21ST CENTURY FILMS CORP", "21ST CENTURY HOLDING CO", "21ST CENTURY INSURANCE GROUP",
"21ST CENTURY ROBOTICS", "24 7 MEDIA INC", "24 7 REAL MEDIA INC",
"360 COMMUNICATIONS CO", "360NETWORKS INC", "3COM CORP", "3DFX INTERACTIVE INC",
"3M CO", "3SBIO INC", "7 ELEVEN INC", "A A FOODS LTD", "ROBINS A H INC"
), class = "factor")), .Names = c("abbreviated", "full"), row.names = c(NA,
63L), class = "data.frame")
Any suggestions would be deeply appreciated. Thanks in advance.