2

First post, long time user.

I'm tryin to efficiently sum a column based on 2 criteria for every ID in another data frame of a different length. Below is an example:

   ID
1  A 
2  B
3  C

ID   Color   Type  Price
A  Green   1     5
A  Blue    2     6
B  Green   3     7
B  Blue    2     2
C  Green   2     4
C  Blue    4     5

For each ID, I'd like to sum the price if the color is blue and the type is 2. The result would hopefully be the below:

   ID  Price
1  A   6
2  B   2
3  C   0

This seems like an easy task but I can't figure it out for some reason. Also, I'll need to perform this operation on 2 large data sets (>1,000,000 rows each). I've created a function and used it in a loop for prior problems like this but that solution doesn't work because of the amount of information. I feel that a function from the apply would probably be best but I can't get them to work.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
BFretti
  • 23
  • 2
  • 1
    Hint: create a new column defined as `df$Price*(df$Type==2 & df$Color=="Blue")` (which is `Price` when your conditions are met and 0 otherwise), then `aggregate` that column by the `ID` column with `sum` (or use any `dplyr` or `data.table` equivalent). – nicola Apr 30 '18 at 10:22
  • with `dplyr`, `dt %>% group_by(ID) %>% summarize(totalPrice = sum(Price[Type==2 & Color==1])`. –  Apr 30 '18 at 10:24

2 Answers2

2

I changed a bit your data example so it takes into account the fact that not all ID are in the first data frame, and that there are two values to sum solewhere:

df1 <- data.frame(ID = c("A","B","C"))

df2 <- read.table(text = "
                  ID   Color   Type  Price
                  A  Green   1     5
                  A  Blue    2     6
                  A  Blue    2     4
                  B  Green   3     7
                  B  Blue    2     2
                  C  Green   2     4
                  C  Blue    4     5
                  D  Green   2     2
                  D  Blue    4     8
                  ",header = T)

The two main package to do that fast and on big data.frame are dplyr and data.table. They are quite equivalent (almost, see data.table vs dplyr: can one do something well the other can't or does poorly?). Here are the two solutions:

library(data.table)

setDT(df2)[ID %in% unique(df1$ID), .(sum = sum(Price[ Type == 2 & Color == "Blue"])),by = ID]

   ID sum
1:  A  10
2:  B   2
3:  C   0

You could do

setDT(df2)[ID %in% unique(df1$ID) & Type == 2 & Color == "Blue", .(sum = sum(Price)),by = ID]

but you will discard C as the entire condition for the row selection is not met:

   ID sum
1:  A  10
2:  B   2

and with dplyr:

library(dplyr)

df2 %>%
  filter(ID %in% unique(df1$ID)) %>%
  group_by(ID) %>%
  summarize(sum = sum(Price[Type==2 & Color=="Blue"]))

# A tibble: 3 x 2
  ID      sum
  <fct> <int>
1 A        10
2 B         2
3 C         0
denis
  • 5,580
  • 1
  • 13
  • 40
0

A sapply version. It may exist more elegant ways to write it, but if you have big tables as you said, you can easily parallelized it.

Using the data as proposed by @denis:

df1 <- data.frame(ID = c("A","B","C"))

df2 <- read.table(text = "
                  ID   Color   Type  Price
                  A  Green   1     5
                  A  Blue    2     6
                  A  Blue    2     4
                  B  Green   3     7
                  B  Blue    2     2
                  C  Green   2     4
                  C  Blue    4     5
                  D  Green   2     2
                  D  Blue    4     8
                  ",header = T)

Here a simple function that does what you want with sapply:

 getPrices <- function(tableid=df1,tablevalues=df2,color="Blue",type=2){
     filteredtablevalues <- droplevels(tablevalues[ tablevalues$Color == "Blue" & tablevalues$Type == 2 & tablevalues$ID %in% df1$ID,])
     #droplevels could be skipped by using unique(as.character(filteredtablevalues$ID)) in the sapply, not sure what would be the quickest 
     sapply(levels(filteredtablevalues$ID),function(id,tabval)
            {
            sum(tabval$Price[tabval$ID == id])
        },tabval=filteredtablevalues)
 }

As you see i added two parameters that allow you to select for pair color/type. And you can add this:

 tmp=getPrices(df1,df2)
 finaltable=cbind.data.frame(ID=names(tmp),Price=tmp)

If you absolutely need a data frame with a column ID and a column Price.

I will try some benchmark when I have time, but written this way you should be able to easily parallelize this with library(parallel) and library(Rmpi), which can save your life if you have very very big datasets.

EDIT :

Benchmark:

I was not able to reproduce the dplyr example proposed by @denis but I could comparer the data.table version:

#Create a bigger dataset
nt=10000 #nt as big as you want
df2=rbind.data.frame(df2,
                     list(ID= sample(c("A","B","C"),nt,replace=T),
                          Color=sample(c("Blue","Green"),nt,replace=T),
                          Type=sample.int(5,nt,replace=T),
                          Price=sample.int(5,nt,replace=T)
                          )
                     )

You can benchmark using the library(microbenchmark):

library(microbenchmark)
microbenchmark(sply=getPrices(df1,df2),dtbl=setDT(df2)[ID %in% unique(df1$ID), .(sum = sum(Price[ Type == 2 & Color == "Blue"])),by = ID],dplyr=df2 %>%  filter(ID %in% unique(df1$ID)) %>%  group_by(ID) %>%  summarize(sum = sum(Price[Type==2 & Color=="Blue"])))

On my computer it gives:

Unit: milliseconds
  expr      min       lq      mean    median        uq      max neval
  sply 78.37484 83.89856  97.75373  89.17033 118.96890 131.3226   100
  dtbl 75.67642 83.44380  93.16893  85.65810  91.98584 137.2851   100
 dplyr 90.67084 97.58653 114.24094 102.60008 136.34742 150.6235   100

Edit2:

sapply appears to be slightly quicker than the data.table approach, though not significantly. But using sapply could be really helpful for you have huge ID table. Then you use library(parallel) and gain even more time.

Now the data.table approach seems to be the quickest. But still, the avantage of sapply is that you can parallelized it easily. Though in that case and given how I wrote the function getPrices, it will be efficient only if your ID table is huge.

Simon C.
  • 1,058
  • 15
  • 33
  • after doing more benchmark I think that sapply (without parallelization), is always quicker for `nt>10000000`, and dplyr is always slower. – Simon C. Apr 30 '18 at 14:59