Aggregating while merging two dataframes in R

Question

The ultimate goal is to sum the total quantity(transact_data$qty) for each record in product_info where the transact_data$productId exists in product_info, and where transact_data$date is between product_info$beg_date and product_info$end_date.

The dataframes are below:

product_info <- data.frame(productId = c("A", "B", "A", "C","C","B"), 
                      old_price = c(0.5,0.10,0.11,0.12,0.3,0.4),
                      new_price = c(0.7,0.11,0.12,0.11,0.2,0.3),
                      beg_date = c("2014-05-01", "2014-06-01", "2014-05-01", "2014-06-01","2014-05-01", "2014-06-01"),
                      end_date = c("2014-05-31", "2014-06-31", "2014-05-31", "2014-06-31","2014-05-31", "2014-06-31"), stringsAsFactors=FALSE)

transact_data <- data.frame(productId=c('A', 'B','A', 'C','A', 'B','C', 'B','A', 'C','A', 'B'),
                  date=c("2014-05-05", "2014-06-22", "2014-07-05", "2014-08-31","2014-05-03", "2014-02-22",
                    "2014-05-21", "2014-06-19", "2014-03-09", "2014-06-22","2014-04-03", "2014-07-08"),
                    qty =c(12,15,5,21,13,17,2,5,11,9,6,4), stringsAsFactors=FALSE)

My first step was to merge both dataframes by productId:

sku_transact_merge <-merge(x=product_info, y=transact_data, by = c("productId"))

The next step was to calculate the quantity sum:

sku_transact_merge$total_qty <- ifelse(sku_transact_merge$date >= sku_transact_merge$beg_date & 
                                       sku_transact_merge$date <= sku_transact_merge$end_date, 
                                     aggregate(qty ~ productId+beg_date+end_date,
                                               data= sku_transact_merge, sum), 0)

The result is not what I desire, and i'm getting an error that says

(list) object cannot be coerced to type 'double'

Any pointers on how to properly execute this logic would be much appreciated!

Is it intentional that you have duplicate `productId,beg_date,end_date` records in `product_info` (i.e. `A,2014-05-01,2014-05-31` and `B,2014-06-01,2014-06-31`) ? That will cause records in `transact_data` to be duplicated by the `merge()` call, and thus summed multiple times during the eventual `sum()`, which I can't imagine would be correct. — bgoldst, May 30 '15 at 00:05
Do you want to aggregate those duplicated `qty` values? If your desired output is a table unique by `productId,beg_date,end_date`, then those duplicated `qty` values would all be summed into the same output record, which I still can't imagine would be correct. Your desired output is not clear to me. — bgoldst, May 30 '15 at 00:13
Or do you want one `total_qty` value per record in `product_info`? — bgoldst, May 30 '15 at 00:15
The result set would be used in a model - I have other parameters but total quantity per record in product_info is what I'm looking for. Thank you. — BlackHat, May 30 '15 at 00:21

Veerendra Gadekar · Accepted Answer · 2015-06-07T12:07:15.903

3

This could be another way to do this using dplyr() (This should be effective if your data set is huge)

library(dplyr)
df = subset(sku_transact_merge, date > beg_date & date < end_date)
df = subset(df, select= -c(date))
out = unique(df %>% group_by(productId,old_price) %>% mutate(qty = sum(qty)))

#> out
#Source: local data frame [6 x 6]
#Groups: productId, old_price

#productId old_price new_price   beg_date   end_date qty
#1         A      0.50      0.70 2014-05-01 2014-05-31  25
#2         A      0.11      0.12 2014-05-01 2014-05-31  25
#3         B      0.10      0.11 2014-06-01 2014-06-31  20
#4         B      0.40      0.30 2014-06-01 2014-06-31  20
#5         C      0.12      0.11 2014-06-01 2014-06-31   9
#6         C      0.30      0.20 2014-05-01 2014-05-31   2

or else you could use data.table

library(data.table)
out = setDT(df)[, list(qtynew = sum(qty)), by = list(productId, old_price)]

#> out
#   productId old_price qtynew
#1:         A      0.50     25
#2:         A      0.11     25
#3:         B      0.10     20
#4:         B      0.40     20
#5:         C      0.12      9
#6:         C      0.30      2

edited Jun 07 '15 at 12:07

answered May 30 '15 at 01:56

Veerendra Gadekar

4,452
19
24

what does the -6 mean? If it limits my result to a 6x6 matrix then it would not be scalable if I have more than 6 records. Or is it just the number of columns? – BlackHat May 30 '15 at 02:10
This might sound dumb but why are we using "=" to return in R. Why not "<-" ? Thanks. – BlackHat May 30 '15 at 02:17
in R scripting both `=` and `<-` could be used as the assigning operators. I prefer `=` because its just easy to type! if you are interested to know more, check this discussion [here](http://stackoverflow.com/questions/1741820/assignment-operators-in-r-and) – Veerendra Gadekar May 30 '15 at 02:23
Yea it seems to have worked. I'm doing some QA right now. Thank you. – BlackHat May 30 '15 at 02:27
Haha yes of course. I'm testing out the other solutions and will put the tick on the preferred. Thank you. – BlackHat May 30 '15 at 02:30
So in this case If I wanted to repeat the same process (I know it sounds stupid) but append a new quantity to what we have from df ... what would be the fastest way to do it without having to drop columns and merge again? – BlackHat May 30 '15 at 02:41
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79174/discussion-between-user3116753-and-veerendra-gadekar). – BlackHat May 30 '15 at 03:08

score 2 · Answer 2 · answered May 29 '15 at 23:52

2

One approach would be to loop through the elements in product_info, determining all matching products in transact_data and summing their quantities:

sapply(seq(nrow(product_info)), function(x) {
  d <- product_info[x,]
  sum(transact_data$qty[transact_data$productId == d$productId &
                        transact_data$date >= d$beg_date &
                        transact_data$date <= d$end_date])
})
# [1] 25 20 25  9  2 20

You could add this as a new column in product_info if desired.

answered May 29 '15 at 23:52

josliber

43,891
12
98
133

Thanks. I tried your solution and tried to append the values to a new column in product_info. I get an "error level sets of factors are different" – BlackHat May 30 '15 at 01:48
@user3116753 try converting the variables to strings with the `as.character` function. – josliber May 30 '15 at 02:45

score 1 · Answer 3 · edited May 23 '17 at 11:44

product_info$total_qty <- aggregate(col~row,which(outer(product_info$productId,transact_data$productId,`==`)&outer(product_info$beg_date,transact_data$date,`<=`)&outer(product_info$end_date,transact_data$date,`>=`),arr.ind=T),function(x) sum(transact_data$qty[x]))$col;
product_info;
##   productId old_price new_price   beg_date   end_date total_qty
## 1         A      0.50      0.70 2014-05-01 2014-05-31        25
## 2         B      0.10      0.11 2014-06-01 2014-06-31        20
## 3         A      0.11      0.12 2014-05-01 2014-05-31        25
## 4         C      0.12      0.11 2014-06-01 2014-06-31         9
## 5         C      0.30      0.20 2014-05-01 2014-05-31         2
## 6         B      0.40      0.30 2014-06-01 2014-06-31        20

Explanation

First, a logical matrix is constructed for each of the three match criteria, using outer() to compare every record in product_info with every record in transact_data. These three logical matrices are logical-ANDed together to form a final logical matrix representing which combinations of records match.

outer(product_info$productId,transact_data$productId,`==`)
&outer(product_info$beg_date,transact_data$date,`<=`)
&outer(product_info$end_date,transact_data$date,`>=`)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## [1,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [3,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Then, the row and column indexes with TRUE are ascertained via a call to which() with arr.ind=T. Row indexes represent the matching records from product_info (since it was on the left of the outer() calls), and column indexes represent the matching records from transact_data.

which(...,arr.ind=T)
##       row col
##  [1,]   1   1
##  [2,]   3   1
##  [3,]   2   2
##  [4,]   6   2
##  [5,]   1   5
##  [6,]   3   5
##  [7,]   5   7
##  [8,]   2   8
##  [9,]   6   8
## [10,]   4  10

Since we want to sum qty values from transact_data for each record in product_info, we can aggregate() the col indexes grouping by row by writing a custom aggregation function to index transact_data$qty with the col indexes and sum() them to return a single value for each row.

aggregate(col~row,...,function(x) sum(transact_data$qty[x]))
##   row col
## 1   1  25
## 2   2  20
## 3   3  25
## 4   4   9
## 5   5   2
## 6   6  20

Finally, we can assign the result directly to product_info$total_qty to complete the solution.

product_info$total_qty <- ...$col;

I'm not entirely sure if it is a guarantee that aggregate() will always return its result ordered by the grouping column(s). I just asked this at Does aggregate() guarantee that the result will be ordered by the grouping columns?.

Also, I just realized that direct assignment will fail if not all records in product_info had at least one matching record in transact_data.

If either of those assumptions are violated, the solution can be fixed as follows:

product_info$total_qty <- with(aggregate(col~row,which(outer(product_info$productId,transact_data$productId,`==`)&outer(product_info$beg_date,transact_data$date,`<=`)&outer(product_info$end_date,transact_data$date,`>=`),arr.ind=T),function(x) sum(transact_data$qty[x])),col[match(1:nrow(product_info),row)]);
product_info;
##   productId old_price new_price   beg_date   end_date total_qty
## 1         A      0.50      0.70 2014-05-01 2014-05-31        25
## 2         B      0.10      0.11 2014-06-01 2014-06-31        20
## 3         A      0.11      0.12 2014-05-01 2014-05-31        25
## 4         C      0.12      0.11 2014-06-01 2014-06-31         9
## 5         C      0.30      0.20 2014-05-01 2014-05-31         2
## 6         B      0.40      0.30 2014-06-01 2014-06-31        20

Now, instead of the final step of dereferencing $col, we must construct a complete vector of length equal to the number of rows in product_info, and match() the qty sums (which are inside col) to their corresponding indexes (inside row), with a little help from with().

product_info$total_qty <- with(...,col[match(1:nrow(product_info),row)]);

Current machine is running out of memory using your method so will test on another and get back to you. — BlackHat, May 30 '15 at 01:57
Nope. I like the solution but my machine is actually running out of memory when I try to run this. It's not your solution. This machine needs to go. Thanks for your help. — BlackHat, May 30 '15 at 05:31

Aggregating while merging two dataframes in R

3 Answers3

Explanation

Linked