Add the row values with partial match and give aggregate sum using r

Question

I have a data with three columns --> Name, Amount_A, Amount_B.

data_Activity <- structure(list(Name_Corp = c("3F INDUSTRIES LTD",
                                              "99 GAMES ONLINE PVT LTD",
                                              "A J HOSPITAL RESEARCH CENTRE",
                                              "A P COOPERATIVE OILSEEDS GROWERS FEDERATION LTD",
                                              "A2Z ONLINE SERVICES PVT LTD",
                                              "AAA PEE APPLIANCES",
                                              "AAKANKSHA MANAGEMENT CONSULTANCY & HOLDINGS PVT LTD",
                                              "AAKAR FOUNDRY PVT LTD",
                                              "AARA TECH PRIVATE LIMITED",
                                              "AARVI ENCON LIMITED",
                                              "AARVI ENCON PVT LTD GAMESA REGULAR",
                                              "AAS ALUMNI ASSOCIATION",
                                              "AAVISHKAAR VENTURE MANAGEMENT SERVICES PVT LTD",
                                              "ABAN OFFSHORE LIMITED",
                                              "ABBOTT HEALTHCARE PRIVATE LIMITED",
                                              "ABS INDIA PVT LTD",
                                              "ACCENTURE",
                                              "ACCENTURE SERVICES PVT LTD-ASOPL",
                                              "ACCENTURE SERVICES PVT LTD-IDB",
                                              "ACCENTURE SERVICES PVT LTD-ZMS",
                                              "ACCORD SOFTWARE & SYSTEMS PVT LTD",
                                              "ACE MANUFACTURING SYSTEM",
                                              "ACE MANUFACTURING SYSTEMS LIMITED",
                                              "ACE MULTI AXES SYSTEMS LTD"), 
                                AMount_A = c(794, 
                                             754,
                                             7517,
                                             4862,
                                             49395,
                                             638,
                                             0,
                                             2734,
                                             0,
                                             14471,
                                             0,
                                             164,
                                             15537,
                                             9720,
                                             5645,
                                             2046,
                                             10344299,
                                             0,
                                             0,
                                             0,
                                             44832,
                                             2107,
                                             61429,
                                             18825), 
                                AMount_B = c(5198,
                                             3800,
                                             5706,
                                             8227,
                                             28799,
                                             1147,
                                             54,
                                             2754,
                                             966,
                                             13162,
                                             1139,
                                             267,
                                             8035,
                                             10396,
                                             41251,
                                             2790,
                                             0,
                                             9586514,
                                             57029,
                                             19072,
                                             58598,
                                             7731,
                                             64743,
                                             20964)))

My "Name" column is having similar row items with different AMount_A & AMount_B values. I need to aggregate the data based on similar names:

Eg:

Accenture, Accenture Services are same. I need the Aggregated amount_A & Amount_B in R

kindly help in R ! I have used fuzzy match but no use

This question has been asked before. See https://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r — Arno, Aug 28 '18 at 06:58
It is a single table where I need the aggregate based on similar corp_names — PC_K, Aug 28 '18 at 07:32

RLave · Answer 1 · 2018-08-28T08:23:39.010

library(tidyverse)

data_Activity_sum <- data_Activity %>% 
  mutate(first_char = gsub( " .*$", "", Name_Corp)) %>% 
  group_by(first_char) %>% 
  summarise(sum_a = sum(AMount_A), sum_b = sum(AMount_B)) %>% 
  mutate(sum_a_b = sum_a + sum_b)

I used gsub to strip everything after the first white space. Then I aggregated by that, see if this works.

You can of course change the gsub to match accordingly. I can see already that you might have some problem with some cases, so maybe you will need to come up with a custom regex instead of the simple gsub.

Output:

# A tibble: 17 x 4
   first_char    sum_a   sum_b  sum_a_b
   <chr>         <dbl>   <dbl>    <dbl>
 1 3F              794    5198     5992
 2 99              754    3800     4554
 3 A             12379   13933    26312
 4 A2Z           49395   28799    78194
 5 AAA             638    1147     1785
 6 AAKANKSHA         0      54       54
 7 AAKAR          2734    2754     5488
 8 AARA              0     966      966
 9 AARVI         14471   14301    28772
10 AAS             164     267      431
11 AAVISHKAAR    15537    8035    23572
12 ABAN           9720   10396    20116
13 ABBOTT         5645   41251    46896
14 ABS            2046    2790     4836
15 ACCENTURE  10344299 9662615 20006914
16 ACCORD        44832   58598   103430
17 ACE           82361   93438   175799

As other already suggested you should have looked at Merging two Data Frames using Fuzzy/Approximate String Matching in R.

You need to create a second dataframe that has the "simplified" names, and merge with stringdist_join

library(fuzzyjoin)

data_names <- tibble(
  Name_Corp = c("3F INDUSTRIES", "99 GAMES", "A J HOSPITAL")
) # simplified data names (only 3, you need to pick every case

data_Activity_sub <- data_Activity[1:5, ] # subset of your data

stringdist_join(data_Activity_sub, data_names, by = "Name_Corp", method = "soundex")

# A tibble: 3 x 4
  Name_Corp.x                  AMount_A AMount_B Name_Corp.y  
  <chr>                           <dbl>    <dbl> <chr>        
1 3F INDUSTRIES LTD                 794     5198 3F INDUSTRIES
2 99 GAMES ONLINE PVT LTD           754     3800 99 GAMES     
3 A J HOSPITAL RESEARCH CENTRE     7517     5706 A J HOSPITAL

From here use *group.by* on Name_Corp.y.

A J HOSPITAL RESEARCH CENTRE", "A P COOPERATIVE OILSEEDS GROWERS FEDERATION LTD" are different entities. Based on first whitespace separation, these two gets clubbed as "A" which is not correct! — PC_K, Aug 28 '18 at 07:48
As I stated my solution doesn't cover all options. If you have just these names I suggest to strip the first white space from these two (you'll get AJ and AP). — RLave, Aug 28 '18 at 07:59

Add the row values with partial match and give aggregate sum using r

1 Answers1