We have a data.frame d2_cleaned
> dput(d2_cleaned[1:3,c(2,4,5)])
structure(list(Customer.ID = c(110531L, 110531L, 110531L), Time.Spent.Watching = c(16032,
10919, 236), Video.ID.v26 = c(3661L, 4313L, 3661L)), .Names = c("Customer.ID",
"Time.Spent.Watching", "Video.ID.v26"), row.names = c(333515L,
333516L, 333522L), class = "data.frame")
>
We have another df=distinct_customers_after_cleaning
, the first column is the unique User.ID
so each user is represented once. The rest of the columns (ignore col2) are all the unique movies. The df looks like this:
> dput(distinct_customers_after_cleaning[1:5,1:5])
structure(list(Customer.ID = c(110531L, 318721L, 468491L, 568071L,
1390371L), Hits.After.Cleaning = c(58L, 44L, 98L, 6L, 5L), `3661` = c(0,
0, 0, 0, 0), `4313` = c(0, 0, 0, 0, 0), `3661.1` = c(0, 0, 0,
0, 0)), .Names = c("Customer.ID", "Hits.After.Cleaning", "3661",
"4313", "3661.1"), row.names = c(NA, 5L), class = "data.frame")
>
What I want is to fill in the values of df distinct_customers_after_cleaning. To do this, I want to take each row of Time.Spent.Watching of d2_cleaned and sum it in the right place of distinct_customers_after_cleaning. To find the right place I need to match both user and movie IDs: if (d2_cleaned[i, 'Customer.ID'] == distinct_customers_after_cleaning[j, 'Customer.ID']) and if (d2_cleaned[i, 'Video.ID.v26'] == names(distinct_customers_after_cleaning[y])) Here is the for loops I used:
#fill in rows
for (j in 1 : 10000) {
print(j)
for (i in 1 : nrow(d2_cleaned)) {
if (d2_cleaned[i, 'Customer.ID'] == distinct_customers_after_cleaning[j, 'Customer.ID']) {
for (y in 1:ncol(distinct_customers_after_cleaning)) {
if (d2_cleaned[i, 'Video.ID.v26'] == names(distinct_customers_after_cleaning[y])) {
distinct_customers_after_cleaning[j, y] <- distinct_customers_after_cleaning[j, y] + d2_cleaned[i,'Time.Spent.Watching']
}
}
}
}
}
Although this code works as I want, it is very slow (needs 4 days to go through all data). Could you please recommend a better solution, possibly including aggregate
?