0

Say I have 5 dataframes with identical columns but different row lengths. I want to make 1 dataframe that takes a specific column from each of the 5 dataframes, and fills in with NA's (or whatever) where there isn't a length match. I've seen questions on here that show how to do this with one-off vectors, but I'm looking for a way to do it with bigger sets of data.

Ex: 2 dataframes of equal length:

long <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))
long2 <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))

I can create a list that combines them, then create an empty dataframe and populate it with a common variable from the dataframes in the list:

list1 <- list(long, long2)
df1 <- as.data.frame(matrix(0, ncol = 5, nrow = 350))
df1[,1:2] <- sapply(list, '[[', 'accepted')

And it works.

But when I have more dataframes of unequal length, this approach fails:

long <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))
long2 <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))
medlong <- data.frame(accepted = rnorm(300, 2000), cost = rnorm(300,5000))
medshort <- data.frame(accepted = rnorm(150, 2000), cost = rnorm(150,5000))
short <- data.frame(accepted = rnorm(50, 2000), cost = rnorm(50,5000))

Now making the list and combined dataframe:

list2 <- list(long, long2, medlong, medshort, short)
df2 <- as.data.frame(matrix(0, ncol = 5, nrow = 350))
df1[,1:5] <- sapply(list, '[[', 'accepted')

I get the error about size mismatch:

Error in [<-.data.frame(*tmp*, , 1:5, value = c(1998.77096640377, : replacement has 700 items, need 1750

The only solution I've found to populating this dataframe with columns of unequal length from other dataframes is something along the lines of:

combined.df <- as.data.frame(matrix(0, ncol = 5, nrow = 350))
combined.df[,1] <- long[,2]
combined.df[,2] <- c(medlong[,2], rep(NA, nrow(long) - nrow(medlong))

But there's got to be a more elegant and faster way to do it... I know I'm missing something huge conceptually here

zero323
  • 322,348
  • 103
  • 959
  • 935
Marc Tulla
  • 1,751
  • 2
  • 20
  • 34

3 Answers3

2

One way would be to find the length of the longest column and then concatenate shorter columns with the appropriate number of NAs. One way would be like this (with data of a more reasonable size for a MWE!)...

out <- lapply(  list1 , '[[', 'accepted')

#  Find length of longest column
len <- max( sapply( out , length ) )

# Stack shorter columns with NA at the end
dfs <- sapply( out , function(x) c( x , rep( NA , len - length(x) ) ) ) 

#  Make data.frame and set column names at same time
setNames( do.call( data.frame , dfs ) , paste0("V" , 1:length(out) ) )
          V1         V2          V3
1 -1.0913212 -2.4864497  0.04220331
2 -0.5252874  0.8030984  0.21774515
3  0.6914167  0.9685629  1.47159957
4         NA         NA -0.89809670
5         NA         NA  0.51140539
6         NA         NA -0.46833136
7         NA         NA -0.40085707
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
1

You can try using merge:

long$rn <- rownames(long)
long2$rn <- rownames(long2)
medlong$rn <- rownames(medlong)
medshort$rn <- rownames(medshort)
short$rn <- rownames(short)

result <- (merge(merge(merge(merge(
    long, long2[, cols], by=c('rn'), all=T),
    medlong[, cols], by=c('rn'), all=T),
    medshort[, cols], by=c('rn'), all=T),
    short[, cols], by=c('rn'), all=T))
zero323
  • 322,348
  • 103
  • 959
  • 935
1

You could, also, "subset" each dataframe like df[nrow(df) + n,] in order to insert NAs:

#dataframes of different rows
long <- data.frame(accepted = rnorm(15, 2000), cost = rnorm(15,5000))
long2 <- data.frame(accepted = rnorm(10, 2000), cost = rnorm(10,5000))
long3 <- data.frame(accepted = rnorm(12, 2000), cost = rnorm(12,5000))

#insert all dataframes in list to manipulate
myls <- list(long, long2, long3)

#maximum number of rows
max.rows <- max(nrow(long), nrow(long2), nrow(long3))

#insert the needed `NA`s to each dataframe
new_myls <- lapply(myls, function(x) { x[1:max.rows,] })

#create  wanted dataframe
do.call(cbind, lapply(new_myls, `[`, "accepted"))

#   accepted accepted accepted
#1  2001.581 1999.014 2001.810
#2  2000.071 2000.033 2000.588
#3  1999.931 2000.188 2000.833
#4  1998.467 1999.891 1997.645
#5  2000.682 2000.144 1999.639
#6  1999.693 1999.341 1998.959
#7  2000.222 1998.939 2002.271
#8  1999.104 1998.530 1997.600
#9  1998.435 2001.496 2001.129
#10 1998.160 2000.729 2001.602
#11 1999.267       NA 1999.733
#12 2000.048       NA 2001.431
#13 1999.504       NA       NA
#14 2000.660       NA       NA
#15 2000.160       NA       NA
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • This is great, thanks Alexis. I'm not sure I understand the do.call functionality yet but everything else makes sense – Marc Tulla Oct 21 '13 at 14:53
  • if you run `lapply(new_myls, `[`, "accepted")` you'll see that is a list. In this specific case, `do.call` calls `cbind` on the elements of the list. – alexis_laz Oct 21 '13 at 14:56