-2

My data has a long format similar to the one below:

ID Language  MotherTongue  SpokenatHome   HomeLang
1    English   English                      English
1    French                   French        
1    Polish    Polish         
2    Lebanese  Lebanese        Lebanese    Labanese
2    Arabic                    Arabbic

Here is the output I am looking for:

ID  Language1 Language2 Language 3  MotherTongue1  MotherTongue2  SpokenatHome1 HomeLan
1   English    French     Polish     English         Polish        French       English
2   Lebanese   Arabic                Labanese                      Arabic

I'm using using the melt and dcast functions of the reshape2 package, but it does not work. Does anyone know how to do this? Thanks.

df<-df[,c("OEN", "Langugae","MotherTongue", "SpokenatHome", "MainHomeLanguage")]
dfl <- melt(df, id.vars=c("OEN", "Langugae"), measure.vars=c("MotherTongue", "SpokenatHome", "MainHomeLanguage"),
            variable.name="Language")

dfw <- dcast(dfl, OEN ~  Langugae , value.var="value" )
Anthony
  • 67
  • 1
  • 6
  • 2
    Please add your code, even if it doesn't work. – zx8754 Jun 02 '14 at 13:56
  • I think your question similar wit this link http://stackoverflow.com/questions/5890584/reshape-data-from-long-to-wide-format-r but i dont know it will work for your data or no – rischan Jun 02 '14 at 14:07

3 Answers3

3

Hi you can try this (nonetheless it rely on plyr for indexing language) :

df <- read.table(text="ID Language
1    English
1    French
1    Spanish
1    Polish
2    English
2    French
3    Lebanese
3    Arabic", header=T)

# For creating an index of Language by ID (there is probably a better way to do this)
library(plyr)
df <- ddply(df, .(ID), mutate, ID2 =  1:length(ID)

# The same as above without using plyr :
df$ID2 <- unlist(tapply(X = df$ID, INDEX = df$ID, FUN = function(x) 1:length(x)))

# And use reshape for doing what you want
reshape(data = df, timevar = "ID2", v.names = "Language", idvar = "ID", direction = "wide")

#  ID Language.1 Language.2 Language.3 Language.4
#1  1    English     French    Spanish     Polish
#5  2    English     French       <NA>       <NA>
#7  3   Lebanese     Arabic       <NA>       <NA>

The same with your second dataset :

df2 <- read.table(text="ID Language  MotherTongue  SpokenatHome   HomeLang
1    English   English            NA      English
1    French       NA           French        NA
1    Polish    Polish         NA           NA
2    Lebanese  Lebanese        Lebanese    Labanese
2    Arabic       NA    NA             Arabbic", header=TRUE)

df2 <- ddply(df2, .(ID), mutate, ID2 =  1:length(ID))
reshape(data = df2, timevar = "ID2", v.names = c("Language", "MotherTongue", "SpokenatHome", "HomeLang"), idvar = "ID", direction = "wide")
Victorp
  • 13,636
  • 2
  • 51
  • 55
  • Thanks Victorp. It works for a much smaller dataset. – Anthony Jun 02 '14 at 17:23
  • Considering that it works only for a smaller dataset (100 rows) on my computer, do you think it's a good idea to split it into smaller pieces and then use rbind to combine them all together? Thanks – Anthony Jun 02 '14 at 19:41
  • 27000 is not big... try to close software using RAM and load in R only the dataset you need. – Victorp Jun 03 '14 at 07:49
  • I tried that. Rebooting my computer a few times, and running only Rstudio. – Anthony Jun 03 '14 at 13:07
  • When I run the scripts, the CPU monitor jumps from 20% to >90%, and then it stops. – Anthony Jun 03 '14 at 13:30
0

Not very elegant:

df <- read.table(header = TRUE, as.is = TRUE, text = '
                 ID Language
1    English
1    French
1    Spanish
1    Polish
2    English
2    French
3    Lebanese
3    Arabic')


# split by ID
sp <- tapply(df$Language, df$ID, function(x) x)
# max length
mls <- max(sapply(x, length))

# make same length
spNA <- lapply(sp, function(x) {
  l <- length(x)
  if(l == mls){
    out <- x
  } else {
    out <- c(x, rep(NA, mls-l))
  }
  return(out)
  }
)
# rbind
do.call(rbind, spNA)

# [,1]       [,2]     [,3]      [,4]    
# 1 "English"  "French" "Spanish" "Polish"
# 2 "English"  "French" NA        NA      
# 3 "Lebanese" "Arabic" NA        NA    
EDi
  • 13,160
  • 2
  • 48
  • 57
0

Here is a reshape2 solution. Like the other answers, I had to add in a variable to represent the number of answer within each ID which I did with ddply from plyr.

df = read.table(text="ID Language  MotherTongue  SpokenatHome   HomeLang
1    English   English            NA      English
1    French       NA           French        NA
1    Polish    Polish         NA           NA
2    Lebanese  Lebanese        Lebanese    Labanese
2    Arabic       NA              Arabbic NA", header=TRUE)

require(reshape2)
df1 = melt(df, id.vars=c("ID"), variable.name = "type")

require(plyr)
# Add in variable for number of unique  answers per ID
df1 = ddply(df1, .(ID, type), mutate, num = 1:length(ID))
# Cast the dataset wide
df2 = dcast(df1, ID ~ type + num)

This gives multiple columns for each category (Language, HomeLang, etc.). If you need to remove columns that contain all NA, you can do so as below (which I found here).

df2[colSums(is.na(df2)) < nrow(df2)]
Community
  • 1
  • 1
aosmith
  • 34,856
  • 9
  • 84
  • 118
  • Great! This gives me the format that I wanted, but it does not work for the full dataset (27770 rows and 8 columns). I have to reduce it to only 100 rows. As Victorp suggested, this might be a memory issue. My machine is an i5 @ 3.1Ghz, 4GB memory and is running Window 7. Just don't know why it takes a lot memory for this task. Thanks again for all your help – Anthony Jun 02 '14 at 17:21