3

The ultimate goal is to sum the total quantity(transact_data$qty) for each record in product_info where the transact_data$productId exists in product_info, and where transact_data$date is between product_info$beg_date and product_info$end_date.

The dataframes are below:

product_info <- data.frame(productId = c("A", "B", "A", "C","C","B"), 
                      old_price = c(0.5,0.10,0.11,0.12,0.3,0.4),
                      new_price = c(0.7,0.11,0.12,0.11,0.2,0.3),
                      beg_date = c("2014-05-01", "2014-06-01", "2014-05-01", "2014-06-01","2014-05-01", "2014-06-01"),
                      end_date = c("2014-05-31", "2014-06-31", "2014-05-31", "2014-06-31","2014-05-31", "2014-06-31"), stringsAsFactors=FALSE)

transact_data <- data.frame(productId=c('A', 'B','A', 'C','A', 'B','C', 'B','A', 'C','A', 'B'),
                  date=c("2014-05-05", "2014-06-22", "2014-07-05", "2014-08-31","2014-05-03", "2014-02-22",
                    "2014-05-21", "2014-06-19", "2014-03-09", "2014-06-22","2014-04-03", "2014-07-08"),
                    qty =c(12,15,5,21,13,17,2,5,11,9,6,4), stringsAsFactors=FALSE)

My first step was to merge both dataframes by productId:

sku_transact_merge <-merge(x=product_info, y=transact_data, by = c("productId"))

The next step was to calculate the quantity sum:

sku_transact_merge$total_qty <- ifelse(sku_transact_merge$date >= sku_transact_merge$beg_date & 
                                       sku_transact_merge$date <= sku_transact_merge$end_date, 
                                     aggregate(qty ~ productId+beg_date+end_date,
                                               data= sku_transact_merge, sum), 0)

The result is not what I desire, and i'm getting an error that says

(list) object cannot be coerced to type 'double'

Any pointers on how to properly execute this logic would be much appreciated!

TylerDurden
  • 1,632
  • 1
  • 20
  • 30
BlackHat
  • 736
  • 1
  • 10
  • 24
  • Is it intentional that you have duplicate `productId,beg_date,end_date` records in `product_info` (i.e. `A,2014-05-01,2014-05-31` and `B,2014-06-01,2014-06-31`) ? That will cause records in `transact_data` to be duplicated by the `merge()` call, and thus summed multiple times during the eventual `sum()`, which I can't imagine would be correct. – bgoldst May 30 '15 at 00:05
  • Yup that was intentional. – BlackHat May 30 '15 at 00:06
  • Do you want to aggregate those duplicated `qty` values? If your desired output is a table unique by `productId,beg_date,end_date`, then those duplicated `qty` values would all be summed into the same output record, which I still can't imagine would be correct. Your desired output is not clear to me. – bgoldst May 30 '15 at 00:13
  • Or do you want one `total_qty` value per record in `product_info`? – bgoldst May 30 '15 at 00:15
  • Yup I'm looking for total quantity per record. – BlackHat May 30 '15 at 00:19
  • The result set would be used in a model - I have other parameters but total quantity per record in product_info is what I'm looking for. Thank you. – BlackHat May 30 '15 at 00:21

3 Answers3

3

This could be another way to do this using dplyr() (This should be effective if your data set is huge)

library(dplyr)
df = subset(sku_transact_merge, date > beg_date & date < end_date)
df = subset(df, select= -c(date))
out = unique(df %>% group_by(productId,old_price) %>% mutate(qty = sum(qty)))

#> out
#Source: local data frame [6 x 6]
#Groups: productId, old_price

#productId old_price new_price   beg_date   end_date qty
#1         A      0.50      0.70 2014-05-01 2014-05-31  25
#2         A      0.11      0.12 2014-05-01 2014-05-31  25
#3         B      0.10      0.11 2014-06-01 2014-06-31  20
#4         B      0.40      0.30 2014-06-01 2014-06-31  20
#5         C      0.12      0.11 2014-06-01 2014-06-31   9
#6         C      0.30      0.20 2014-05-01 2014-05-31   2

or else you could use data.table

library(data.table)
out = setDT(df)[, list(qtynew = sum(qty)), by = list(productId, old_price)]

#> out
#   productId old_price qtynew
#1:         A      0.50     25
#2:         A      0.11     25
#3:         B      0.10     20
#4:         B      0.40     20
#5:         C      0.12      9
#6:         C      0.30      2
Veerendra Gadekar
  • 4,452
  • 19
  • 24
  • what does the -6 mean? If it limits my result to a 6x6 matrix then it would not be scalable if I have more than 6 records. Or is it just the number of columns? – BlackHat May 30 '15 at 02:10
  • This might sound dumb but why are we using "=" to return in R. Why not "<-" ? Thanks. – BlackHat May 30 '15 at 02:17
  • in R scripting both `=` and `<-` could be used as the assigning operators. I prefer `=` because its just easy to type! if you are interested to know more, check this discussion [here](http://stackoverflow.com/questions/1741820/assignment-operators-in-r-and) – Veerendra Gadekar May 30 '15 at 02:23
  • Yea it seems to have worked. I'm doing some QA right now. Thank you. – BlackHat May 30 '15 at 02:27
  • Haha yes of course. I'm testing out the other solutions and will put the tick on the preferred. Thank you. – BlackHat May 30 '15 at 02:30
  • So in this case If I wanted to repeat the same process (I know it sounds stupid) but append a new quantity to what we have from df ... what would be the fastest way to do it without having to drop columns and merge again? – BlackHat May 30 '15 at 02:41
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79174/discussion-between-user3116753-and-veerendra-gadekar). – BlackHat May 30 '15 at 03:08
2

One approach would be to loop through the elements in product_info, determining all matching products in transact_data and summing their quantities:

sapply(seq(nrow(product_info)), function(x) {
  d <- product_info[x,]
  sum(transact_data$qty[transact_data$productId == d$productId &
                        transact_data$date >= d$beg_date &
                        transact_data$date <= d$end_date])
})
# [1] 25 20 25  9  2 20

You could add this as a new column in product_info if desired.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • Thanks. I tried your solution and tried to append the values to a new column in product_info. I get an "error level sets of factors are different" – BlackHat May 30 '15 at 01:48
  • @user3116753 try converting the variables to strings with the `as.character` function. – josliber May 30 '15 at 02:45
1
product_info$total_qty <- aggregate(col~row,which(outer(product_info$productId,transact_data$productId,`==`)&outer(product_info$beg_date,transact_data$date,`<=`)&outer(product_info$end_date,transact_data$date,`>=`),arr.ind=T),function(x) sum(transact_data$qty[x]))$col;
product_info;
##   productId old_price new_price   beg_date   end_date total_qty
## 1         A      0.50      0.70 2014-05-01 2014-05-31        25
## 2         B      0.10      0.11 2014-06-01 2014-06-31        20
## 3         A      0.11      0.12 2014-05-01 2014-05-31        25
## 4         C      0.12      0.11 2014-06-01 2014-06-31         9
## 5         C      0.30      0.20 2014-05-01 2014-05-31         2
## 6         B      0.40      0.30 2014-06-01 2014-06-31        20

Explanation

First, a logical matrix is constructed for each of the three match criteria, using outer() to compare every record in product_info with every record in transact_data. These three logical matrices are logical-ANDed together to form a final logical matrix representing which combinations of records match.

outer(product_info$productId,transact_data$productId,`==`)
&outer(product_info$beg_date,transact_data$date,`<=`)
&outer(product_info$end_date,transact_data$date,`>=`)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## [1,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [3,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Then, the row and column indexes with TRUE are ascertained via a call to which() with arr.ind=T. Row indexes represent the matching records from product_info (since it was on the left of the outer() calls), and column indexes represent the matching records from transact_data.

which(...,arr.ind=T)
##       row col
##  [1,]   1   1
##  [2,]   3   1
##  [3,]   2   2
##  [4,]   6   2
##  [5,]   1   5
##  [6,]   3   5
##  [7,]   5   7
##  [8,]   2   8
##  [9,]   6   8
## [10,]   4  10

Since we want to sum qty values from transact_data for each record in product_info, we can aggregate() the col indexes grouping by row by writing a custom aggregation function to index transact_data$qty with the col indexes and sum() them to return a single value for each row.

aggregate(col~row,...,function(x) sum(transact_data$qty[x]))
##   row col
## 1   1  25
## 2   2  20
## 3   3  25
## 4   4   9
## 5   5   2
## 6   6  20

Finally, we can assign the result directly to product_info$total_qty to complete the solution.

product_info$total_qty <- ...$col;

I'm not entirely sure if it is a guarantee that aggregate() will always return its result ordered by the grouping column(s). I just asked this at Does aggregate() guarantee that the result will be ordered by the grouping columns?.

Also, I just realized that direct assignment will fail if not all records in product_info had at least one matching record in transact_data.

If either of those assumptions are violated, the solution can be fixed as follows:

product_info$total_qty <- with(aggregate(col~row,which(outer(product_info$productId,transact_data$productId,`==`)&outer(product_info$beg_date,transact_data$date,`<=`)&outer(product_info$end_date,transact_data$date,`>=`),arr.ind=T),function(x) sum(transact_data$qty[x])),col[match(1:nrow(product_info),row)]);
product_info;
##   productId old_price new_price   beg_date   end_date total_qty
## 1         A      0.50      0.70 2014-05-01 2014-05-31        25
## 2         B      0.10      0.11 2014-06-01 2014-06-31        20
## 3         A      0.11      0.12 2014-05-01 2014-05-31        25
## 4         C      0.12      0.11 2014-06-01 2014-06-31         9
## 5         C      0.30      0.20 2014-05-01 2014-05-31         2
## 6         B      0.40      0.30 2014-06-01 2014-06-31        20

Now, instead of the final step of dereferencing $col, we must construct a complete vector of length equal to the number of rows in product_info, and match() the qty sums (which are inside col) to their corresponding indexes (inside row), with a little help from with().

product_info$total_qty <- with(...,col[match(1:nrow(product_info),row)]);
Community
  • 1
  • 1
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • 1
    Current machine is running out of memory using your method so will test on another and get back to you. – BlackHat May 30 '15 at 01:57
  • Nope. I like the solution but my machine is actually running out of memory when I try to run this. It's not your solution. This machine needs to go. Thanks for your help. – BlackHat May 30 '15 at 05:31