1

I have df that represents users browsing behavior over time. Therefore the df contains a unique UserId and each row has a timestamp and represents a visit to a certain website. Each website has a unique website Id and a unique website category, say c("electronics", "clothes",....). Now I want to count per row how many unique websites per category the user has visited up to that row (including that row). I call this variable "breadth" since it represents how broad a user is browsing through the internet.

So far I only manage to produce dumb code that creates the total number of unique websites visited per category by filtering on each category and then take the length of the unique vector by the user and then do a left join. Therefore I do lose information about the development over time.

Thanks so much in advance!

total_breadth <- df %>% filter(category=="electronics")  %>% 
                         group_by(user_id) %>% 
                           mutate(breadth=length(unique(website_id)))


#Structure of the df I want to achieve:

user_id   time   website_id  category     breadth

1         1      70         "electronics" 1 
1         2      93         "clothing"    1
1         3      34         "electronics" 2
1         4      93         "clothing"    1
1         5      26         "electronics" 3
1         6      70         "electronics" 3  

#Structure of the df I produce:

user_id   time   website_id  category     breadth

1         1      70         "electronics" 3 
1         2      93         "clothing"    1
1         3      34         "electronics" 3
1         4      93         "clothing"    1
1         5      26         "electronics" 3
1         6      70         "electronics" 3

Ratan Uday Kumar
  • 5,738
  • 6
  • 35
  • 54
ReLa
  • 33
  • 3
  • Your `left_join` is not doing what you think it's doing. `total_breadth` already includes all your other variables plus `breadth`. If I understand correctly what you're doing with the `filter`, you should perhaps add `category` to the grouping vars. It's unclear to me how you lose information over time. If `df` changes, so will `breadth`. – Fons MA Apr 21 '19 at 22:49
  • The problem with my df is, that the breadth variable is not calculated the right way. The breadth variable should reflect the breadth based on unique website_ids per category in a cumulative manner and not in total. The length(unique()) combination calculates the variable over the complete vector per user per category. – ReLa Apr 21 '19 at 22:51
  • Related question: https://stackoverflow.com/questions/15698399/cumulative-count-of-unique-values-in-r The core part of the answers from over there works. Within each group of values you want to do `cummax(as.numeric(factor(x, levels=unique(x))))` – thelatemail Apr 21 '19 at 23:11

1 Answers1

0

This seems to be a case of a split, apply and combine. Create a binary matrix of 1s and 0s whose dimensions are:

  • No. of Rows = No. of rows in the original data
  • No of Columns = No. of unique website categories

Each Row represents the timestamp and each column represents the respective website category. So a cell will be equal to 1 if and only if the user has visited the website for that website category on the respective timestamp else it will be 0.

Take the cumulative sum for individual columns of this matrix and then create a final column where it takes the value only for the visited website category on the respective timestamp.

Though it doesn't seem to be an elegant solution, hope this should solve your problem temporarily.

Garima Gulati
  • 43
  • 1
  • 7