I'm analyzing data from an eCommerce site and everything is stored in a relational format.
I want to calculate the probability that a product is bought by a user (the times a product is ordered divided by the number of orders of the user).
So that the final result is:
User Product Probability
1 | 2323 | 0.32
userid <-c(1,1,1,1,2,2,2,2)
product<-c(876,324,122,65,44,324,54,23)
probability <- c(0.32,0.10,0.25,0.5,0.7,0.8,0.45,0.05)
exampleresult <- data.frame(userid,product,probability)
Example data:
orderid <- c(100,111,122,134,144,152,164,177,188,199,200,251,222)
userid <- c(1,1,1,2,2,2,2,3,3,4,5,5,6)
orders<-data.frame(orderid,userid)
productid <- c(66,55,44,54,32,23,65,122,656,324,876,342)
productname<-c('soda','corn','apple','milk','juice','water','potato','banana','orange','fish','meat','salami')
products<-data.frame(productid,productname)
orderid <- c(100,100,100,100,100,111,111,111,122,134,134,134,134,144,144,144,144,144,144,152,164,177,188,188,188,188,199,200,251,222)
productid <- c(55,54,324,23,324,54,876,324,122,65,65,44,324,54,23,44,324,23,66,876,65,55,32,122,66,66,44,54,66,65)
ordpro<- data.frame(orderid,productid)
Every time a user buys something an order is created with all the products he or she bought. One user can have multiple orders and each order can have multiple products.
Currently I'm doing this without success. Plus it takes a lot of time considering the amount of users.
x <- numeric(length(unique(orders$userid)))
y <- list()
for (i in 1:numeric(length(unique(orders$userid)))) {
y[[i]] <- table(ordpro[ordpro$orderid %in% orders[orders$userid == "orderid"], "productid"])/length(orders,[orders$userid == i,"orderid"])
x[i] <- length(y[[i]])
}
mydata <- data.frame(x,y)