2

I have the following df, which was obtained from a excel file:

df1 <- data.frame( Colour = c("Green","Red","Blue"), 
                   Code = c("N","U", "U"), 
                   User1 = c("John","Brad","Peter"), 
                   User2 = c("Meg","Meg","John"), 
                   User3= c("", "Lucy", ""))

I need to rearrange it in order to obtain a data frame where all names are listed in the first column (only once) and the colors (and respective codes) appear in the following columns, as shown:

df2 <- data.frame(User=c("John","Brad","Peter","Meg","Lucy"),
                  Color1 = c("Green","Red","Blue","Green","Red"),
                  Code1 = c("N","U","U","N","U"), 
                  Color2=c("Blue","","","Red",""),
                  Code2=c("U","","","U",""))

I´d appreciate some help. Many thanks,

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Gil33
  • 123
  • 1
  • 9

3 Answers3

4

I hesitate to post this because of the conceptual similarity with @akrun's answer, but you can do this also with merged.stack from my "splitstackshape" package along with reshape from base R.

library(splitstackshape)
reshape(
  getanID(
    merged.stack(df1, var.stubs = "User", sep = "var.stubs")[User != ""], 
    "User"), 
  direction = "wide", idvar = "User", timevar = ".id", drop = ".time_1")
#     User Colour.1 Code.1 Colour.2 Code.2
# 1: Peter     Blue      U       NA     NA
# 2:  John     Blue      U    Green      N
# 3:   Meg    Green      N      Red      U
# 4:  Brad      Red      U       NA     NA
# 5:  Lucy      Red      U       NA     NA

merged.stack makes the data long, getanID creates an ID variable to be used when going to a wide form, reshape does the actual transformation from this semi-wide form to a wide form.


This was the best I could think of for the "dplyr" + "tidyr" users. Seems quite verbose, but shouldn't be too hard to follow:

library(dplyr)
library(tidyr)

df1 %>%
  gather(var, User, User1:User3) %>%      # Get the data into a long form
  filter(User != "") %>%                  # Drop empty rows
  group_by(User) %>%                      # Group by User
  mutate(Id = sequence(n())) %>%          # Create a new id variable
  gather(var2, value, Colour, Code) %>%   # Go long a second time
  unite(Key, var2, Id) %>%                # Combine values to create a key
  spread(Key, value, fill = "")           # Convert back to a wide form
# Source: local data frame [6 x 6]
# 
#     var  User Code_1 Code_2 Colour_1 Colour_2
# 1 User1  Brad      U             Red         
# 2 User1  John      N           Green         
# 3 User1 Peter      U            Blue         
# 4 User2  John             U              Blue
# 5 User2   Meg      N      U    Green      Red
# 6 User3  Lucy      U             Red         
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

It's not pretty, but here's another solution in pure base R that uses a couple of calls to reshape():

reshape(transform(subset(reshape(df1,varying=grep('^User',names(df1)),dir='l',v.names='User'),User!=''),id=NULL,time=ave(c(User),User,FUN=seq_along),User=factor(User)),dir='w',idvar='User',sep='');
##      User Colour1 Code1 Colour2 Code2
## 1.1  John   Green     N    Blue     U
## 2.1  Brad     Red     U    <NA>  <NA>
## 3.1 Peter    Blue     U    <NA>  <NA>
## 1.2   Meg   Green     N     Red     U
## 2.3  Lucy     Red     U    <NA>  <NA>
bgoldst
  • 34,190
  • 6
  • 38
  • 64
3

We can use dcast from the devel version of data.table, i.e. v1.9.5+. It can take multiple value.var columns. We convert the data.frame to data.table (setDT(df1)), melt the data with id columns as 'Colour' and 'Code', remove the rows where 'User' is not equal to '' ([User!='']), create a grouping sequence based on the 'User' column and dcast. The instructions to install are here

library(data.table)#v1.9.5+
dcast(melt(setDT(df1), id.var=c('Colour', 'Code'), 
           value.name='User')[User!=''][,
              N:=1:.N, User], User~N, value.var=c('Colour', 'Code'))
#    User 1_Colour 2_Colour 1_Code 2_Code
#1:  Brad      Red       NA      U     NA
#2:  John    Green     Blue      N      U
#3:  Lucy      Red       NA      U     NA
#4:   Meg    Green      Red      N      U
#5: Peter     Blue       NA      U     NA

Or as @Arun mentioned in the comments, we can make use of the subset argument in dcast instead of [User!='']

dcast(melt(setDT(df1), id.var=c('Colour', 'Code'), 
             value.name='User')[,N:= 1:.N, User],
       subset=.(User !=''), User~N, value.var=c('Colour', 'Code'))
#    User 1_Colour 2_Colour 1_Code 2_Code
#1:  Brad      Red       NA      U     NA
#2:  John    Green     Blue      N      U
#3:  Lucy      Red       NA      U     NA
#4:   Meg    Green      Red      N      U
#5: Peter     Blue       NA      U     NA
akrun
  • 874,273
  • 37
  • 540
  • 662