1

I've a list of 11,383 data frames. I need to merge them into one big data frame, but the have different columns (2,3,4 columns) so when i use rbind_all from Dplyr i get not desired result.

One way around would be to rbind data frames that have the same number of columns (the have different headers, but i don't mind about them). As i have data frames with 2,3 and 4 columns, it would result in 3 big data frames according to the number of columns of every single data frame in the list.

Expected Output:

Data frame with 4 columns:

SKU             Tv y Video  Tecnología  Deportes
2003091070002P  Tv y Video  Tecnología  Deportes
2.00E+12        Tv y Video  Tecnología  Deportes
2003120060008P  Tv y Video  Tecnología  Deportes
2004121460080P  Cómputo     Tecnología  Decohogar
2.00G+12        Cómputo     Tecnología  Decohogar
2004121440802P  Cómputo     Tecnología  Decohogar
2.00A+12        Cómputo     Tecnología  Decohogar

Data frame with 2 columns:

            SKU         PROMOCIONES
1   110 2089060010006P  PROMOCIONES
2   111 2089660010006P  PROMOCIONES
#

This is my code:

df_2col <- data.frame()  #Starts Data frame for dfs with 2 columns

df_3col <- data.frame()  #Starts Data frame for dfs with 3 columns

df_4col <- data.frame()  #Starts Data frame for dfs with 4 columns


lapply(my_list, function(i){
    if (ncol(i) == 2)
        df_2col <- rbind(i)
    ifelse (ncol(i) == 3)
        df_3col <- rbind(i)
    ifelse (ncol(i) == 4)
        df_4col <- rbind(i)
})

But i get this error:

Error in ifelse(ncol(i) == 3) : argument "no" is missing, with no default 

A list example of my data:

list(list(structure(list(SKU = "2079230130006P", Decohogar = "Decohogar", 
    Para.la.Mesa = "Para.la.Mesa", Copas.y.Vasos = "Copas.y.Vasos"), .Names = c("SKU", 
"Decohogar", "Para.la.Mesa", "Copas.y.Vasos"), class = "data.frame", row.names = 134L)), 
    list(structure(list(SKU = "2079240080001P", Decohogar = "Decohogar", 
        Para.la.Mesa = "Para.la.Mesa", Copas.y.Vasos = "Copas.y.Vasos"), .Names = c("SKU", 
    "Decohogar", "Para.la.Mesa", "Copas.y.Vasos"), class = "data.frame", row.names = 132L)), 
    list(structure(list(SKU = "2069060020005P", PROMOCIONES = "PROMOCIONES"), .Names = c("SKU", 
    "PROMOCIONES"), class = "data.frame", row.names = 111L)), 
    list(structure(list(SKU = "2047121452095P", Dormitorio = "Dormitorio", 
        Colchones = "Colchones", X2.plazas = "X2.plazas"), .Names = c("SKU", 
    "Dormitorio", "Colchones", "X2.plazas"), class = "data.frame", row.names = 223L)), 
    list(structure(list(SKU = "2069060010006P", PROMOCIONES = "PROMOCIONES"), .Names = c("SKU", 
    "PROMOCIONES"), class = "data.frame", row.names = 110L)), 
    list(structure(list(SKU = "2069060010006P", PROMOCIONES = "PROMOCIONES"), .Names = c("SKU", 
    "PROMOCIONES"), class = "data.frame", row.names = 109L)))

NOTE: This would work when you know the number of columns for every single data frame in the list. Is there a way of doing this dinamycaly? I mean, if in the future, there is a data frame with 5 columns, the code should return also a big data frame with 5 columns for this data frames.

Omar Gonzales
  • 3,806
  • 10
  • 56
  • 120
  • Do they have intersecting column names? – akrun Feb 24 '15 at 16:29
  • @akrun, Yes, the have a commun column: "SKU", its the first column in every single data frame. I've put some data, so you can see how it is. Thanks. – Omar Gonzales Feb 24 '15 at 16:31
  • Try `library(data.table);rbindlist(do.call(`c`, my_list), fill=TRUE)` – akrun Feb 24 '15 at 16:34
  • Or maybe merge them? `Reduce(function(...) merge(..., all=T), my_list)` – talat Feb 24 '15 at 16:36
  • @akrun, With data.table got: `Error in rbindlist(do.call(c, my_list), fill = TRUE) : Item 1 of list input is not a data.frame, data.table or list` – Omar Gonzales Feb 24 '15 at 16:38
  • Based on the dput showed, I am getting `8` columns with no error message. I am using `data.table_1.9.5` – akrun Feb 24 '15 at 16:39
  • @OmarGonzales I posted the output as an answer. Did you get the error message on the dput dataset or on something else? – akrun Feb 24 '15 at 16:46
  • @akrun, that is not what i'm lookin for. I don't want any NAs values, i think this happens cause the dataframes have different columns. This is way i think, it's better to separate them in new data frames according their ncol(). See my code, for the if else part. – Omar Gonzales Feb 24 '15 at 16:47
  • Oh Ok, Can you post the expected output also based on the my_list. I was looking at docendodiscimus code also which gave the same result. Sorry. – akrun Feb 24 '15 at 16:48
  • @OmarGonzales Based on your code, it is rbinding all those list elements with `3` columns, also with 4 columns separately without checking the column names. Here, in your `my_list`, there are instances where the 4 column elements have different column names. What should we do iin that case? – akrun Feb 24 '15 at 16:52
  • I don't mind about the columns names. As you see, the col names are the same in the rows. I just need to combine them, and maybe use random colnames. I put some desired output, that i hope will make my point clearer. – Omar Gonzales Feb 24 '15 at 16:55
  • Here is what I find confusing `SKU Decohogar Para.la.Mesa Copas.y.Vasos` and `SKU Dormitorio Colchones X2.plazas` Ok, I got it. it is not relevant.. – akrun Feb 24 '15 at 16:56
  • Why would you want to merge all data.frames in a big data.frame, when they have a different number of columns? You should merge them into a list. My 2 cents – codingEnthusiast Feb 24 '15 at 16:58
  • @akrun, think of it as 4 columns: SKU, Category, SubCat, Product. So, in Cat you would have: "Decohogar", and "Dormitorio". In SubCat, "Para.la.Mesa" and "Colchones", and in "Product", "Copas.y.Vasos", and "X2.plazas". Thanks. – Omar Gonzales Feb 24 '15 at 17:01
  • @OmarGonzales I updated the solution. please check if that helps – akrun Feb 24 '15 at 17:01
  • @naltipar, i don't want to do that, read the tittle of the question: in different data frames, according to the number of columns of every single data frame within the list. – Omar Gonzales Feb 24 '15 at 17:02
  • @OmarGonzales can you provide the example that reproduce the error? – akrun Feb 24 '15 at 18:20
  • @akrun, this is the error when i run your code: `Error in split.default(my_list1, indx) : group length is 0 but data length > 0` . don't know which exact part is causing it. – Omar Gonzales Feb 24 '15 at 18:29
  • @OmarGonzales I was able to replicate the error by inserting a NA list element. Updated the solution, please check if that helps – akrun Feb 24 '15 at 18:38

2 Answers2

2

We could flatten the list elements do.call(c,..) get the number of columns (ncol) for each list element ("indx"), use this to split the list, rbindlist the resulting elements.

library(data.table)
my_list1 <- do.call(`c`, my_list)
indx <- sapply(my_list1, ncol)
lst <- lapply(split(my_list1, indx), rbindlist)
lst
#$`2`
#              SKU PROMOCIONES
#1: 2069060020005P PROMOCIONES
#2: 2069060010006P PROMOCIONES
#3: 2069060010006P PROMOCIONES

#$`4`
#             SKU  Decohogar Para.la.Mesa Copas.y.Vasos
#1: 2089230130006P  Decohogar Para.la.Mesa Copas.y.Vasos
#2: 2089240080001P  Decohogar Para.la.Mesa Copas.y.Vasos
#3: 2047121452095P Dormitorio    Colchones     X2.plazas

If we need to get separate data.frame objects (not recommended), use list2env

 list2env(setNames(lst, paste0('dat',seq_along(lst))), envir=.GlobalEnv)

Update

If there are NULL or NA values as one of the list elements, we could get this error

my_list1[[7]] <- NA
split(my_list1, sapply(my_list1, ncol))
#Error in split.default(my_list1, sapply(my_list1, ncol)) : 
 #group length is 0 but data length > 0

Then, we could check whether the elements are data.frame ("isDF"), subset the list and get the "ncol", and do as before.

isDF <- sapply(my_list1, is.data.frame)
indx <- sapply(my_list1[isDF], ncol)
lapply(split(my_list1[isDF], indx), rbindlist)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Got this: `Error in split.default(my_listfinal, indx) : group length is 0 but data length > 0` It works nice with the data provided, but maybe some data frame within the list is giving me trouble. Some of them have 0 rows... May you add some explanation for every line of the solution. Thanks. – Omar Gonzales Feb 24 '15 at 17:10
  • @OmarGonzales It would be better if you can post the dput of a small example that reproduce the error as the example you showed works here. – akrun Feb 24 '15 at 17:12
  • got this error this time: `Error in args[[i]] : attempt to select less than one element`, when running this line: `lst <- lapply(split(my_list1[isDF], indx), rbindlist)` – Omar Gonzales Feb 24 '15 at 18:57
  • @OmarGonzales It must be something connected with the NULL elements. How big is your list? Also, test this in a very small subset of your list and if the error persists, then please dput that subset – akrun Feb 24 '15 at 19:01
  • i think, i should have clear my objects (rm = list = ls()). I don't get problems with your code, but get this message: **Error in data.frame(`2` = list(SKU = c("2003120060006P", "2003120060006P", : arguments imply differing number of rows: 3939, 2563, 4786, 95```** – Omar Gonzales Feb 24 '15 at 19:28
  • @OmarGonzales When do you get this error? While writing it to file?? Please check http://stackoverflow.com/questions/23413728/converting-rdata-files-to-csv-error-in-data-frame-arguments-imply-differing-nu – akrun Feb 24 '15 at 19:33
  • yes i was using: write.csv. I've tried that link and got: `Error in lst[[i]] : invalid subscript type 'list'`. Maay you update the answer, to include the method to export those data frames as CSV files? – Omar Gonzales Feb 24 '15 at 19:45
  • @OmarGonzales If you want to save it as different files `lst1 <- lapply(split(my_list1[isDF], indx), rbindlist); lapply(seq_along(lst1), function(i) write.csv(lst1[[i]], file=paste0('file','i', '.csv')` – akrun Feb 25 '15 at 04:16
0

rbind_all has a fill function that replaces blank data with NA. When I tried to use rbind the first time I got an error because the data set you have provided is a list of list not a list of dataframes. I first converted the nested lists to dataframes then used rbind to create the single dataframe.

your example data is "x"

sapply(x, class)
x2 <- lapply(x, as.data.frame)
x3 <- rbind_all(x2)

I don't know if you are making a tidy data frame but if you are you could use tidyr to help

library(tidyr)
x4 <- gather(x3, key=SKU, na.rm=TRUE)[,-2]

UPDATE

The question is solved now but here is my update to my solution to make a list of dataframes of individual catagories

x5 <- lapply(unique(x4$value), function(n) filter(x4, value == n))
Jonno Bourne
  • 1,931
  • 1
  • 22
  • 45