1

i have two data frames which are cleaned and merged as a single csv file , the data frames are like this

  **Source                         Master**

 chang chun petrochemical      CHANG CHUN GROUP
 chang chun plastics           CHURCH AND DWIGHT CO INC
 church  dwight                CITRIX SYSTEMS ASIA PACIFIC P L
 citrix systems  pacific       CNH INDUSTRIAL N.V

now from these , i have to consider the first name and check with each name of master names and find a match that is relevant and print the output as another data frame. the above data frames are few , but i am working with 20k values as such.

My output must look like this

 **Source                         Master                         Result**

 chang chun petrochemical      CHANG CHUN GROUP                 CHANG CHUN GROUP
 chang chun plastics           CHURCH AND DWIGHT CO INC         CHANG CHUN GROUP
 church  dwight                CITRIX SYSTEMS ASIA PACIFIC P L  CHURCH AND DWIGHT CO INC
 citrix systems  pacific       CNH INDUSTRIAL N.V               CITRIX SYSTEMS ASIA PACIFIC P L

I tried this with possible ways with this link Merging through fuzzy matching of variables in R but , no luck so far..!

Thank in advance!!

when i use the above code for large set of data , the result is this-

code used:

Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)

OUTPUT

NA NA  2  3 NA NA NA  6 NA NA  9 NA NA NA 12 NA NA NA 13 14 15 16 NA 18 19 20 21 22 NA 24 NA 26 NA 28 NA NA NA 30 NA NA 33 NA 35 36 37 NA 39 40 NA NA 43 NA 45 46 NA 48 49 50 51 52 53 54 55 56 57 58 NA
 [68] 60 61 62 NA NA NA NA 64 NA 66 67 68 69 70 71 72 73 NA 75 76 77 78 NA 79 80 81 NA 83 84 85 86 87 88

CODE:

Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
   agrep(x, Master.Names$MOD,value=TRUE) })

OUTPUT:

[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] " CHURCH AND DWIGHT CO INC"

[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"

[[5]]
character(0)

and even with for loop no result is produced.

code:

for(i in seq_len(nrow(df$ICIS_Cust_Names)))
  {
    df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
  }
  print(df$reslt)

Code 2: Used for loop just for 100 rows

for (i in 100){
  gr1$x[i] = agrep(gr1$ICIS_Cust_Names[i], gr2$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
  gr2$Y[i] = agrep(gr1$ICIS_Cust_Names[i], gr2$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

Result:

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Error

Error in `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, " church  dwight  " : 
  replacement has 3 rows, data has 100

when observed the result for above is considered , as it checks directly with the row value of each data frames , but i want it to consider first element of Source and check with all the elements of master and come up with a match , likewise for rest. I would appreciate if someone could correct my code ! thanks in advance..!

Community
  • 1
  • 1
KRU
  • 291
  • 4
  • 18

1 Answers1

1

If you want to check the Master.Names only against the first word in Names, this could do the trick:

Names$Mast <- NA
for(i in seq_len(nrow(Names))) 
    Names$Mast[i] <- grep(toupper(x = strsplit(Names[i,1]," ")[[1]][1]), Master.Names$V1,value=TRUE)

Edit

Using sapply instead of a loop could gain you some speed:

Names$Mast <- sapply(Names$V1, function(x) {
    grep(toupper(x = strsplit(x," ")[[1]][1]), Master.Names$V1,value=TRUE)
})

Results

> Names
                        V1                            Mast
1 chang chun petrochemical                CHANG CHUN GROUP
2      chang chun plastics                CHANG CHUN GROUP
3            church dwight        CHURCH AND DWIGHT CO INC
4   citrix systems pacific CITRIX SYSTEMS ASIA PACIFIC P L

Data

Master.Names <- read.csv(text="CHANG CHUN GROUP
CHURCH AND DWIGHT CO INC
CITRIX SYSTEMS ASIA PACIFIC P L
CNH INDUSTRIAL N.V", header=FALSE)

Names <- read.csv(text="chang chun petrochemical
chang chun plastics     
church dwight          
citrix systems pacific", header=FALSE)
Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
  • Maybe the `Master.Names$V1` ? If so, try `Master.Names[,1]` instead. – Dominic Comtois Mar 31 '15 at 05:33
  • Its throwing error on this toupper(x = strsplit(Names[i,1]," ")[[1]][1]), . – KRU Mar 31 '15 at 05:54
  • can i use `amatch` of `stringdist` to match the above ? can i provide data frame value such that `amatch` doesn't throw error !! – KRU Mar 31 '15 at 06:07
  • My answer assumed your dataframes were named `Names` and `Master.Names` ... is this the case? Or maybe your variables are factors, in which case you'll need to use `as.character()` to convert them to strings. – Dominic Comtois Mar 31 '15 at 06:15
  • for only sample data i named it as cust. names and master names, but i am dealing with huge data frame where the logic applied doesn't in that case. – KRU Mar 31 '15 at 06:23
  • Two things to check out for: class() will tell you if your names variables (on the 2 dataframes) are factors or not. If so, then you need to use as.character as I mentionned earlier. Next, the col. index in the loop (`Names[i,1]`) was 1, assuming the name is the first column. If it's not, obviously you need to change that. Otherwise the size of the dataframe shouldn't impact anything else than speed of execution. – Dominic Comtois Mar 31 '15 at 06:29
  • yea..! exactly speed of execution does matter in my case ! – KRU Mar 31 '15 at 06:31
  • Can you try with `sapply` (see my edit) and see if there's improvement? – Dominic Comtois Mar 31 '15 at 06:47
  • `Res <- sapply(grp1$Customer_Names, function(x) { grep(toupper(x = str_split(x," ")[[1]][1]), grp2$Master_Names,value=TRUE)` }) when i use this for the huge data frames i am using it gives `character(0)` – KRU Mar 31 '15 at 07:09