A sapply
version. It may exist more elegant ways to write it, but if you have big tables as you said, you can easily parallelized it.
Using the data as proposed by @denis:
df1 <- data.frame(ID = c("A","B","C"))
df2 <- read.table(text = "
ID Color Type Price
A Green 1 5
A Blue 2 6
A Blue 2 4
B Green 3 7
B Blue 2 2
C Green 2 4
C Blue 4 5
D Green 2 2
D Blue 4 8
",header = T)
Here a simple function that does what you want with sapply
:
getPrices <- function(tableid=df1,tablevalues=df2,color="Blue",type=2){
filteredtablevalues <- droplevels(tablevalues[ tablevalues$Color == "Blue" & tablevalues$Type == 2 & tablevalues$ID %in% df1$ID,])
#droplevels could be skipped by using unique(as.character(filteredtablevalues$ID)) in the sapply, not sure what would be the quickest
sapply(levels(filteredtablevalues$ID),function(id,tabval)
{
sum(tabval$Price[tabval$ID == id])
},tabval=filteredtablevalues)
}
As you see i added two parameters that allow you to select for pair color/type. And you can add this:
tmp=getPrices(df1,df2)
finaltable=cbind.data.frame(ID=names(tmp),Price=tmp)
If you absolutely need a data frame with a column ID and a column Price.
I will try some benchmark when I have time, but written this way you should be able to easily parallelize this with library(parallel)
and library(Rmpi)
, which can save your life if you have very very big datasets.
EDIT :
Benchmark:
I was not able to reproduce the dplyr example proposed by @denis but I could comparer the data.table version:
#Create a bigger dataset
nt=10000 #nt as big as you want
df2=rbind.data.frame(df2,
list(ID= sample(c("A","B","C"),nt,replace=T),
Color=sample(c("Blue","Green"),nt,replace=T),
Type=sample.int(5,nt,replace=T),
Price=sample.int(5,nt,replace=T)
)
)
You can benchmark using the library(microbenchmark)
:
library(microbenchmark)
microbenchmark(sply=getPrices(df1,df2),dtbl=setDT(df2)[ID %in% unique(df1$ID), .(sum = sum(Price[ Type == 2 & Color == "Blue"])),by = ID],dplyr=df2 %>% filter(ID %in% unique(df1$ID)) %>% group_by(ID) %>% summarize(sum = sum(Price[Type==2 & Color=="Blue"])))
On my computer it gives:
Unit: milliseconds
expr min lq mean median uq max neval
sply 78.37484 83.89856 97.75373 89.17033 118.96890 131.3226 100
dtbl 75.67642 83.44380 93.16893 85.65810 91.98584 137.2851 100
dplyr 90.67084 97.58653 114.24094 102.60008 136.34742 150.6235 100
Edit2:
sapply
appears to be slightly quicker than the data.table
approach, though not significantly. But using sapply
could be really helpful for you have huge ID
table. Then you use library(parallel)
and gain even more time.
Now the data.table
approach seems to be the quickest. But still, the avantage of sapply
is that you can parallelized it easily. Though in that case and given how I wrote the function getPrices
, it will be efficient only if your ID
table is huge.