0

We have a data.frame d2_cleaned

> dput(d2_cleaned[1:3,c(2,4,5)])
structure(list(Customer.ID = c(110531L, 110531L, 110531L), Time.Spent.Watching = c(16032, 
10919, 236), Video.ID.v26 = c(3661L, 4313L, 3661L)), .Names = c("Customer.ID", 
"Time.Spent.Watching", "Video.ID.v26"), row.names = c(333515L, 
333516L, 333522L), class = "data.frame")
>

We have another df=distinct_customers_after_cleaning, the first column is the unique User.ID so each user is represented once. The rest of the columns (ignore col2) are all the unique movies. The df looks like this:

> dput(distinct_customers_after_cleaning[1:5,1:5])
structure(list(Customer.ID = c(110531L, 318721L, 468491L, 568071L, 
1390371L), Hits.After.Cleaning = c(58L, 44L, 98L, 6L, 5L), `3661` = c(0, 
0, 0, 0, 0), `4313` = c(0, 0, 0, 0, 0), `3661.1` = c(0, 0, 0, 
0, 0)), .Names = c("Customer.ID", "Hits.After.Cleaning", "3661", 
"4313", "3661.1"), row.names = c(NA, 5L), class = "data.frame")
>

What I want is to fill in the values of df distinct_customers_after_cleaning. To do this, I want to take each row of Time.Spent.Watching of d2_cleaned and sum it in the right place of distinct_customers_after_cleaning. To find the right place I need to match both user and movie IDs: if (d2_cleaned[i, 'Customer.ID'] == distinct_customers_after_cleaning[j, 'Customer.ID']) and if (d2_cleaned[i, 'Video.ID.v26'] == names(distinct_customers_after_cleaning[y])) Here is the for loops I used:

#fill in rows
for (j in 1 : 10000) { 
  print(j)
  for (i in 1 : nrow(d2_cleaned)) {

    if (d2_cleaned[i, 'Customer.ID'] == distinct_customers_after_cleaning[j, 'Customer.ID']) {

      for (y in 1:ncol(distinct_customers_after_cleaning)) {

        if (d2_cleaned[i, 'Video.ID.v26'] == names(distinct_customers_after_cleaning[y])) {

          distinct_customers_after_cleaning[j, y] <- distinct_customers_after_cleaning[j, y] + d2_cleaned[i,'Time.Spent.Watching']

        }
      }
    }
  }
}

Although this code works as I want, it is very slow (needs 4 days to go through all data). Could you please recommend a better solution, possibly including aggregate ?

Alex S
  • 91
  • 10
  • 2
    too much information. please revise your question and only add input data, expected output data and what you have tried. – Roman May 15 '18 at 11:16
  • Ok I will try to do that. – Alex S May 15 '18 at 11:21
  • I tried to delete unnecessary info, I hope what is left is what is needed.. – Alex S May 15 '18 at 11:28
  • Ok jogo, thanks for letting me know, I am new to stackoverflow and I try to make valid posts. Although issue is solved, I am putting my comment up in the description. – Alex S May 15 '18 at 12:06

1 Answers1

0

You can use dplyr and tidyr to achieve your goal.. There is an example..

library(dplyr)
library(tidyr)
df <- data.frame(CUSTOMERS = c('u1','u1','u2','u1','u2','u3'),
                 VIDEOS = c('v1','v1', 'v1', 'v2','v3','v2'),
                 TIME = c(7, 12, 9,  2, 6, 4))

df_summary <- df %>% group_by(CUSTOMERS, VIDEOS) %>% summarise(TIME_SUM = sum(TIME))

df_summary_spread <- df_summary %>% spread(key = 'VIDEOS', value = 'TIME_SUM')

df:

      CUSTOMERS VIDEOS TIME
1        u1     v1    7
2        u1     v1   12
3        u2     v1    9
4        u1     v2    2
5        u2     v3    6
6        u3     v2    4

df_summary:

  CUSTOMERS VIDEOS TIME_SUM
  <fct>     <fct>     <dbl>
1 u1        v1          19.
2 u1        v2           2.
3 u2        v1           9.
4 u2        v3           6.
5 u3        v2           4.

df_summary_spread:

  CUSTOMERS    v1    v2    v3
  <fct>     <dbl> <dbl> <dbl>
1 u1          19.    2.   NA 
2 u2           9.   NA     6.
3 u3          NA     4.   NA 
Ika8
  • 391
  • 1
  • 12
  • If you use this way could you please share the time it takes? I'm curious ^^ – Ika8 May 15 '18 at 12:08
  • Thank you so much, without having tested it yet, you example is exactly what I was trying to explain. Not only you most likely solved my issue, but you showed me the right way that I should have asked this question. Trying to implement it right now, will let you know the time it takes as soon as it is done. Thank you so much :) – Alex S May 15 '18 at 12:20
  • I'm not gonna lie, your question was a little bit messy. I'm glad that helps you :) – Ika8 May 15 '18 at 12:23
  • It is crazy! It works SUPER fast. `df_summary <- ` runs instantly and `df_summary_spread <- ` runs in like 6-8 seconds.. `> dim(d2_cleaned) [1] 465642 15 > dim(df_summary_spread) [1] 10197 8280` – Alex S May 15 '18 at 12:47
  • @AlexS `aggregate()` can not reshape your data from *long* to *wide* https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format or https://stackoverflow.com/questions/9617348/reshape-three-column-data-frame-to-matrix-long-to-wide-format and for the aggregation: https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family – jogo May 15 '18 at 14:18
  • Thank you so much for taking the time to solve my issue and point out guides that will help me understand R. – Alex S May 15 '18 at 16:34
  • I'm glad that example helps you ^^. Check dplyr docs if you wanna understand what %>% operator does... https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html – Ika8 May 16 '18 at 14:12