merging data in R but omit duplicated entries

Question

good day i hope I will be able to get help with this specific topic, I want to merge my one file to another file only if it has that specific outlet and within that specific month but the rows should not be duplicated.

so my file 1 has the Outlet, Month, Sales while file 2 I want to merge to has the Outlet, Month, Price and Salesvolume, however File 2 can have many instances of a single store for a single month hence the end result might have way too many duplicated entries for a single month.

so If i have Outlet 2453 for Feb 2016 in File 1 then only one entry should represent that particular month with the extra columns received from file2.

My current code:

    BPfac <- merge(File1, File2,by=c("Outlet","Month"))
    BPfac1 = BPfac[which(!duplicated(File1$Outlet) & !duplicated(File2$Month)),]

This doesn't generate the result that i want is there a better way of doing this, I only get two rows from a possible merge of 99K Outlets. The month is coded so Month 256 might represent Feb 2017 that's how it works. I just want the end result to take the first record only when matching between the two files.

Thanks in advance!

File 1

    Month   Sales   Outlet  
    256     24      2453
    256     500     2453
    243     48      2341
    242     47      2356
    243     89      2356

File 2

     Month  Price   Salesvolume Outlet  
     256        1840    222700  2453
     256        1840    237600  2453
     243        1840    277100  2341
     242        1840    279200  2356
     243        1840    451400  2356

Expected Outcome

     Month  Price   Sales   Salesvolume Outlet
     256    1840    24      222700      2453 
     243    1840    48      277100      2341
     242    1840    47      279200      2356

See above the outcome only took the first outlet and month that matched between the two files and discarded the second outlet with the same Outlet and period, hope it makes sense now.

Please provide [some example data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and correct the errors in your question. You can't merge by `Outlet` if your files contain `OutletName` or the misspelled `OuletName`. — neilfws, Sep 25 '17 at 22:27
Perhaps you want to provide also your expected outcome, for me it's not clear which columns you want in the resulting dataframe. — jay.sf, Sep 26 '17 at 07:14

score 0 · Accepted Answer · answered Sep 26 '17 at 08:59

You can try this, using the data.table library:

library(data.table);

file_1 = read.table(
  header = TRUE,
  text = "
  Month   Sales   Outlet  
  256     24      2453
  256     500     2453
  243     48      2341
  242     47      2356
  243     89      2356"
);

file_2 = read.table(
  header = TRUE,
  text = "
  Month  Price   Salesvolume Outlet  
     256        1840    222700  2453
     256        1840    237600  2453
     243        1840    277100  2341
     242        1840    279200  2356
     243        1840    451400  2356
  "
);

# make files into data tables
file_1 = setDT(x = file_1);
file_2 = setDT(x = file_2);

# merge file_1 to file_2 by "Outlet"
outcome = file_1[file_2, on = "Outlet"];
outcome;

   Month Sales Outlet i.Month Price
1:   256    24   2453     256  1840
2:   256   500   2453     256  1840
3:   256    24   2453     256  1840
4:   256   500   2453     256  1840
5:   243    48   2341     243  1840
6:   242    47   2356     242  1840
7:   243    89   2356     242  1840
8:   242    47   2356     243  1840
9:   243    89   2356     243  1840
   Salesvolume
1:      222700
2:      222700
3:      237600
4:      237600
5:      277100
6:      279200
7:      279200
8:      451400
9:      451400

# then, group this table by "Outlet", and then select only the first element in 
# each group. This is the given by .SD[1] and by = "Outlet", or for each data
# table created in each group, select only the first element. .SD refers to the
# Subsets of Data for each group, excluding "Outlet" (the key), and the [1] is
# an indexing operation to get the first element.
outcome = outcome[, .SD[1], by = "Outlet"]
outcome;

   Outlet Month Sales i.Month Price
1:   2453   256    24     256  1840
2:   2341   243    48     243  1840
3:   2356   242    47     242  1840
   Salesvolume
1:      222700
2:      277100
3:      279200

I hope this can help!

This works perfectly Joshua Daly, I'm so happy so I didn't need to group by month as well after all!!! — user1002713, Sep 26 '17 at 12:07
No worries! The `by` command or its equivalent is very useful. — Joshua Daly, Sep 26 '17 at 20:04

merging data in R but omit duplicated entries

1 Answers1