1

I have a clickstream .csv data file in this format :

     Step_Name   Session_ID
        p-1          1
        p-2          1
        defer        1
        p-1          2
        p-3          2
        buy          2
        p-4          3 
        defer        3

I am trying to use clickstream package in R for analysis. This package requires us to convert the above data into this format :

   1    p-1    p-2    defer
   2    p-1    p-3    buy 
   3    p-4    defer   

where the first column is Session ID and number of columns may vary for each session depending on the number of steps the website user has taken. How do I get the initial data into the above format (in either excel or R)?

Also, if there are other ways to analyse clickstream data in R without such data manipulation, please let me know.

user75252
  • 169
  • 2
  • 2
  • 13

1 Answers1

0

Using data.table, load your data:

> library(data.table)
> df = data.table('Step_Name' = c('p-1', 'p-2', 'defer','p-1','p-3','buy','p-4','defer'), 'Session_ID' = c(rep(1,3),rep(2,3),rep(3,2)))
> df
   Step_Name Session_ID
1:       p-1          1
2:       p-2          1
3:     defer          1
4:       p-1          2
5:       p-3          2
6:       buy          2
7:       p-4          3
8:     defer          3

Add a row number ID, restarting for each new Session_ID.

> df[, id := seq_len(.N), by = Session_ID]

Cast your data table:

> df_final = dcast(df, Session_ID ~ id, value.var = 'Step_Name')

And your desired result:

> df_final
   Session_ID   1     2     3
1:          1 p-1   p-2 defer
2:          2 p-1   p-3   buy
3:          3 p-4 defer  <NA>

NB: This relies on your data being in order within Session_ID.

cddt
  • 539
  • 5
  • 14