0

I have two data frames as follows

The first is a survey table which tells when a person's survey was taken

ID = c('1000021','1000021')
SurveyDate = c('2014-05-30','2013-05-01')
dfsurvey = data.frame(ID,SurveyDate)
> dfsurvey
              ID  SurveyDate
1        1000021  2014-05-30
2        1000021  2013-05-01

The second is a hobbies table which tells the person's hobbies recorded on that day. On different days, his hobbies could be different.

ID = c('1000021','1000021','1000021','1000021','1000021','1000021','1000021')
HobbyName = c('Running','Volleyball','Pingpong','Badminton','Swimming','Running','Pingpong')
SurveyDate = c('2014-05-30','2014-05-30','2014-05-30','2014-05-30','2014-05-30','2013-05-01','2013-05-01')
dfhobby = data.frame(ID,HobbyName,SurveyDate)
> dfhobby
   ID                                      HobbyName  SurveyDate
1        1000021                             Running  2014-05-30
2        1000021                          Volleyball  2014-05-30
3        1000021                            Pingpong  2014-05-30
4        1000021                           Badminton  2014-05-30
5        1000021                            Swimming  2014-05-30
6        1000021                             Running  2013-05-01
7        1000021                            Pingpong  2013-05-01

To the survey table which has only two rows, I would like to add the expanded list of hobbies, each hobby getting it's own column, what I would call "flattening". Something like this,

#expected final output - add columns to dfsurvey
> dfsurvey
ID     SurveyDate                        Hobby_Running     Hobby_Volleyball     Hobby_Pingpong    Hobby_Badminton Hobby_Swimming
1        1000021                                 1                    1                  1                  1              1 
2        1000021                                 1                    0                  1                  0              0

This is my code I basically first construct the column names, and then use a nested for loop to mark 1 against the hobby. However, this is very very slow, around one second for one iteration of the nested for loop

#making columns and setting them to 0 as default
hobbyvalues = unique(dfhobby$HobbyName)
for(i in 1:length(hobbyvalues))
{
    print(i)
    dfsurvey[paste("Hobby_",hobbyvalues[i],sep="")] = 0
}

#flattening iterative
for(i in 1:nrow(dfsurvey))
{
    print(i)

    listofhobbies = dfhobby[which(dfhobby$ID == dfsurvey[i,"ID"] & dfhobby$SurveyDate == dfsurvey[i,"SurveyDate"]),"HobbyName"]

    if(length(listofhobbies) > 0)
    {
        for(l in 1:length(listofhobbies))
        {
            dfsurvey[i,paste("Hobby_",listofhobbies[l],sep="")] = 1
        }
    }
}

I have also tried the foreach package and doMC package and was able to write code in parallel. However, this is slow as well.

Is there a better way or library in R which can help me do this? Thanks.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
tubby
  • 2,074
  • 3
  • 33
  • 55
  • 1
    Based on your expected result, something like `reshape2::dcast(dfhobby, SurveyDate ~ HobbyName, length)` might do. – Khashaa Apr 25 '15 at 07:49

2 Answers2

3
> library(reshape2)
> dcast(dfhobby,ID*SurveyDate~HobbyName,fill=0,length)

       ID SurveyDate Badminton Pingpong Running Swimming Volleyball
1 1000021 2013-05-01         0        1       1        0          0
2 1000021 2014-05-30         1        1       1        1          1


> dcast(dfhobby,SurveyDate~HobbyName,fill=0,length)

  SurveyDate Badminton Pingpong Running Swimming Volleyball
1 2013-05-01         0        1       1        0          0
2 2014-05-30         1        1       1        1          1
Prasanna Nandakumar
  • 4,295
  • 34
  • 63
  • thanks. It's working for sample data. But with my data set, I get an error as follows Using SurveyDate as value column: use value.var to override. *** caught segfault *** address 0x7fff413441c0, cause 'memory not mapped' Traceback: 1: .Call("split_indices", group, as.integer(n)) 2: split_indices(.group, .n) 3: vaggregate(.value = value, .group = overall, .fun = fun.aggregate, ..., .default = fill, .n = n) 4: cast(data, formula, fun.aggregate, ..., subset = subset, fill = fill, drop = drop, value.var = value.var) 5: dcast(dfhobby,SurveyDate~HobbyName,fill=0,length) – tubby Apr 25 '15 at 15:12
1

R packages dplyr and tidyr are here just to do this. They are very fast to work on large datasets also. For more info on this see data manipulation cheatsheet in Rstudio page

library(dplyr)
library(tidyr)
df %>% group_by(ID,SurveyDate,HobbyName) %>% 
    mutate(Count = n()) %>% spread(HobbyName ,Count,fill=0)
Koundy
  • 5,265
  • 3
  • 24
  • 37
  • I get this error below Error in UseMethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "function" – tubby Apr 25 '15 at 16:01