3

I have a question that I have not found an answer for. There are similar questions whose solutions don't quite work in my situation. I have a data set that has four columns like this example:

Name   Session   Sequence   Page
Bob     001       001       home
Bob     001       002       news
Bob     001       003       contact_us
Bob     001       004       home
Sally   001       001       home
Sally   001       002       contact_us
Bob     002       001       home
John    001       001       home
John    001       002       about_us

What I would like is something like this

Name    Session   Pages
Bob     001       home-news-contact_us-home
Sally   001       home-contact_us
Bob     002       home
John    001       home-about-us

Now the trick is that Sequence can be from 1:44, or anywhere in between. I am coding in R and have SQLite available. I also need to concatenate in the dashes, but that is easy. If R had something like 'lag' in SAS this would be a snap.

TimL
  • 49
  • 5

1 Answers1

3

You already have some excellent answers, but here is a dplyr one which hopefully lends some readability.

library(dplyr)

df %>%
    group_by(Name, Session) %>% # create summary data for each unique group
    summarise(Page = paste0(Page, collapse = "-")) 

which gives

Source: local data frame [4 x 3]
Groups: Name

   Name Session                      Page
1   Bob       1 home-news-contact_us-home
2   Bob       2                      home
3  John       1             home-about_us
4 Sally       1           home-contact_us

Rereading your question it seems like the sequence of pages is important, i.e. you would like the finally page column to have the pages visited in sequence left to right. Therefore, we include an extra step.

library(dplyr)

df %>%
    group_by(Name, Session) %>% # create summary data for each unique group
    arrange(Sequence) %>% # makes sure that Sequence for each group is in ascending order.
    summarise(Page = paste0(Page, collapse = "-")) 
Alex
  • 15,186
  • 15
  • 73
  • 127
  • I am pulling the initial data with SQL and have an order by on the Sequence so I don't need to do that again here. But this is very cool too. – TimL Mar 11 '15 at 23:56
  • You can also use `dplyr` to pull the data directly from SQL as well. – Alex Mar 12 '15 at 00:03
  • Could I use `dplyr` to remove duplicates? I have a case where I have. `Name Session Sequence Page Bill 001 001 home-home-home-home-contact_us-home` I would like to see that as: `Name Session Sequence Page Bill 001 001 home-contact_us-home` – TimL Mar 12 '15 at 18:53
  • Yes, since you wish to remove duplicates of `Sequence` within each group of `Name` and `Session`, we can simply do `df <- df %>% group_by(Name, Session, Sequence) %>% filter(row_number() == 1)`. Then use this deduplicated dataframe in the above manner :) – Alex Mar 14 '15 at 04:09