0

I have been working on large data set which has names of customers , each of this has to be checked with the master file which has correct names (300 KB) and if matched append the master file name to names of customer file as new column value. My prev Question worked for small data sets

Both Customer & Master file has been cleaned using tm and have tried different logic , but only works on small set of data when applied to huge files not effective, pattern matching doesn't help here my opinion cause no names comes with exact pattern

Cus File

1           chang chun petrochemical  
2                chang chun plastics  
3                     church  dwight  
4        citrix systems asia pacific  
5          cnh industrial services srl
6                   conoco phillips   
7                    conocophillips   
8                  dfk laurence varnay
9                       dtz worldwide 
10  electro motive maintenance operati
11                enterasys networks  
12                   esso  resources  
13                          expedia   
14                            expedia 
15        exponential interactive aust
16        exxonmobil asia pacific pte 
17    exxonmobil chemical asia pac div
18                     exxonmobil png 
19         formula  world championship
20      fortitech asia pacific sdn bhd

Master

1                     chang chun group
2                     church  dwight  
3        citrix systems asia pacific  
4                    cnh industrial nv
5                      conoco phillips
6                  dfk laurence varnay
7                  dtz group  zealand 
8                         caterpillar 
9                 enterasys networks  
10                   exxon mobil group
11                       expedia group
12        exponential interactive aust
13         formula  world championship
14      fortitech asia pacific sdn bhd
15                frhi hotels  resorts
16          gardner denver industries 
17  glencore xstrata international plc
18                            grace   
19                       incomm   nz  
20              information resources 
21                    kbr holdings llc
22                       kennametal   
23                            komatsu 
24     leonhard hofstetter pelzdesign 
25          communications corporation
26              manhattan associates  
27                             mattel 
28                        mmg finance 
29                     nokia oyj group
30                           nortek  

i have tried with this simple loop

for (i in 1:100){
  result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
  #result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

*result *

1           chang chun petrochemical                             <NA> NA
2                chang chun plastics                             <NA> NA
3                     church  dwight                 church  dwight    2
4        citrix systems asia pacific    citrix systems asia pacific    3
5          cnh industrial services srl                           <NA> NA
6                   conoco phillips                  church  dwight    2
7                    conocophillips                              <NA> NA
8                  dfk laurence varnay                           <NA> NA
9                       dtz worldwide                church  dwight    2
10  electro motive maintenance operati                           <NA> NA
11                enterasys networks                             <NA> NA
12                   esso  resources                 church  dwight    2
13                          expedia                              <NA> NA
14                            expedia                            <NA> NA
15        exponential interactive aust               church  dwight    2
16        exxonmobil asia pacific pte                            <NA> NA
17    exxonmobil chemical asia pac div                           <NA> NA
18                     exxonmobil png                church  dwight    2
19         formula  world championship                           <NA> NA
20      fortitech asia pacific sdn bhd 

tried with lapply but no use , as you can notice my master file is large and some times i get error of rows length doesn't match!

mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]

#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
    {
      if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
        sprintf("%s", x)
    }

which method would be best for this ? similar to my Q but not much helpfullI referd few Q from STO

it might be naive but when applied with huge data sets it mis behaves, can anybody familiar with R could correct me with the above code for levenshteinDist

code:

 #check with each value of master file and if matches more than .90 then return master value.


for(i in seq(1:nrow(gr1))
{
  for(j in seq(1:nrow(gr2))
  {
     gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
     if(gr1$jar[i,j]>.90)
         gr1$res[i] = gr2$Master_Names[j] 

  }
}
#Please let know if there is any minute error with this code

Please if anybody has worked with such data in R please help !

Community
  • 1
  • 1
KRU
  • 291
  • 4
  • 18
  • Have you seen this: http://stackoverflow.com/questions/27975705/compare-strings-for-an-approximate-match/27975870#27975870 - it doesn't work perfectly either, but it is okay (16/20 correct with no tweaking whatsoever). – thelatemail Apr 10 '15 at 04:27
  • yea , but this is quite different from my question , i have tried with `adist` too but would be best i guess , if i could solve it through `levenshteinDist`. I guess my logic for above code is okay but have minute error with looping , if u can please correct me – KRU Apr 10 '15 at 04:45
  • i have applied for small data sets which is absolutely perfect for only small data , huge data sets doesn't work – KRU Apr 10 '15 at 04:46
  • `adist` by default uses a generalized Levenshtein difference, just like the function from the `RecordLinkage` package. I doubt it would give wildly different results. – thelatemail Apr 10 '15 at 05:29
  • Here you go: `master[sapply(cus, function(x) which.min(levenshteinDist(x,master)) )]` - essentially the same as the previous solution I linked you to, with the same errors. – thelatemail Apr 10 '15 at 05:40
  • `which.min` index location for min ,why is this needed , `master[sapply(cus, function(x) which.min(levenshteinDist(x,master)) )]` this is throwing error – KRU Apr 13 '15 at 03:35
  • works for me - are you using the `levenshteinDist` function from the `RecordLinkage` package? `which.min` just grabs the case with the lowest (min) distance, which should be the closest match. – thelatemail Apr 13 '15 at 06:15
  • yes , `levenshteinDist` of `RecordLinkage` package itself , each of cust row value must check for match with all the row values of master file and return mater name if it matchs – KRU Apr 13 '15 at 07:07

1 Answers1

0

achieved partial result by

code :

df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])
KRU
  • 291
  • 4
  • 18