5

Say I have a column with Id of a product and a list of data frames with characteristics about them:

bundle dataframe

 bundle
1  284993459
2 1048768805
3  511310430
4 1034630958
5 1235581326

d2 list

[[1]]
    id value
1   35   0.2
2 1462   0.2
3 1109   0.2
4  220   0.2
5  211   0.1

[[2]]
list()

[[3]]
    id name value
1  394        0.5
2 1462        0.5

[[4]]
    id name value
1  926        0.3
2 1462        0.3
3  381        0.3
4  930        0.2

[[5]]
    id name value
1  926        0.5
2 1462        0.5

I need to create columns with all characteristics ID and their values for each product.

bundle =  data.frame(bundle =  c(284993459,1048768805,511310430,1034630958,1235581326))
d2<- list(data.frame(id = c(35,1462,1109,220,211), value = c(0.2, 0.2, 0.2,0.2,0.1)), 
                    data.frame(id = NULL, value = NULL), 
                    data.frame(id = c(394,1462), value = c(0.5,0.5)),
                    data.frame(id = c(926,1462,381,930), value = c(0.3,0.3,0.3,0.2)),
                    data.frame(id = c(926,1462), value = c(0.5,0.5))) 

         bundle    35 1462 1109 220 211 394 1462
    1  284993459   0.2  0.2  0.2 0.2 0.1   0    0
    2 1048768805     0    0    0   0   0   0    0
    3  511310430     0    0    0   0   0 0.5  0.5  

Can't figure out how to do this. Had an idea to unlist this data frame list, but no good came of it, since a have more than 8000 prodict IDs:

for (i in seq(d2))
  assign(paste0("df", i), d2[[i]])  

If we take a different approach than I have to to join transposed characteristics data frames so the values are filled row by row.

zx8754
  • 52,746
  • 12
  • 114
  • 209

4 Answers4

2

Here's a tidyverse solution. First we add a bundle column to all data.frames and stitch them together using purr::map2_dfr , then use tidyr::spread to format as wide.

library(tidyverse)
res <- map2_dfr(bundle$bundle,d2,~mutate(.y,bundle=.x)) %>%
  spread(id,value,)
res[is.na(res)] <- 0
#       bundle  35 211 220 381 394 926 930 1109 1462
# 1  284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0  0.2  0.2
# 2  511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0  0.0  0.5
# 3 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2  0.0  0.3
# 4 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0  0.0  0.5
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Gives me an error: _Error in UseMethod("mutate_"): no applicable method for 'mutate_' applied to an object of class "list"_ – Nikita Pronin May 28 '18 at 08:35
  • with your real data or with the sample data you provided ? – moodymudskipper May 28 '18 at 08:40
  • With the real data, although the real data has exactly the same look, but more rows – Nikita Pronin May 28 '18 at 08:43
  • You could use `class`, `summary`, `str` etc to diagnose the differences between your sample data and real data, if it works with your sample data there's not much more I can do. If you can build a minimal reproducible example from your real data that makes my solution fail I'll be happy to take a look. – moodymudskipper May 28 '18 at 08:47
  • https://github.com/nikitapronin/furry-parakeet Here is the original data. – Nikita Pronin May 28 '18 at 10:33
  • We really need a reproducible example posted in the question Nikita, these questions and answers are supposed to benefit to the next users. One suggestion: try to sample 10 row numbers and extract a sample from `bundle` and `d2` for those, do it with different seeds if necessary until you can reproduce the crash, then post it in your original question as an edit. – moodymudskipper May 28 '18 at 11:33
0

You can first add the bundle to each data.frame within the list, then pivot it using reshape2::dcast or data.table::dcast before updating NAs to 0

ans <- data.table::dcast(
        do.call(rbind, Map(function(nm, DF) within(DF, bundle <- nm), bundle$bundle, d2)),
    bundle ~ id)
ans[is.na(ans)] <- 0
ans

#      bundle  35 211 220 381 394 926 930 1109 1462
#1  284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0  0.2  0.2
#2  511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0  0.0  0.5
#3 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2  0.0  0.3
#4 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0  0.0  0.5

edit: adding more explanations after OP's comment

1) function(nm, DF) within(DF, bundle <- nm) takes the input data.frame DF and adds a new column called bundle with values equal to nm.

2) Map applies a function to the corresponding elements of given vectors. (see ?Map) That means that Map applies the above function using each of the bundle values and add them to each data.frame in d2

chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

Another approach could be

library(data.table)
library(tidyverse)

df <- rbindlist(
  lapply(lapply(d2, function(x) if(nrow(x)==0) data.frame(id=NA, value=NA) else x),  #in case there is no dataframe row in a list assign a blank dataframe
         function(y) y %>% spread(id, value)), #convert all dataframes in wide format
  fill = T) %>%                                #rbind all dataframe in a single dataframe
  select(-`<NA>`) %>%
  cbind.data.frame(bundle = bundle$bundle)

Output is:

    35 211 220 1109 1462 394 381 926 930     bundle
1: 0.2 0.1 0.2  0.2  0.2  NA  NA  NA  NA  284993459
2:  NA  NA  NA   NA   NA  NA  NA  NA  NA 1048768805
3:  NA  NA  NA   NA  0.5 0.5  NA  NA  NA  511310430
4:  NA  NA  NA   NA  0.3  NA 0.3 0.3 0.2 1034630958
5:  NA  NA  NA   NA  0.5  NA  NA 0.5  NA 1235581326

Sample data:

bundle <-  data.frame(bundle =  c(284993459,1048768805,511310430,1034630958,1235581326))
d2 <- list(data.frame(id = c(35,1462,1109,220,211), value = c(0.2, 0.2, 0.2,0.2,0.1)), 
           data.frame(id = NULL, value = NULL), 
           data.frame(id = c(394,1462), value = c(0.5,0.5)),
           data.frame(id = c(926,1462,381,930), value = c(0.3,0.3,0.3,0.2)),
           data.frame(id = c(926,1462), value = c(0.5,0.5))) 
Prem
  • 11,775
  • 1
  • 19
  • 33
  • Gives me an error: _Error in if (nrow(x) == 0) data.frame(id = NA, value = NA) else x : argument is of length zero_ – Nikita Pronin May 28 '18 at 08:38
  • Can you share `dput(d2)`? It runs perfectly fine on the sample data mentioned in your post (or the one which I have used in my answer). In the meantime you can try replacing `if(nrow(x)==0)` code in my answer with `if(nrow(x)==0 | is.null(x))`. – Prem May 28 '18 at 09:47
  • github.com/nikitapronin/furry-parakeet Here is the original data. – Nikita Pronin May 28 '18 at 10:44
  • This seems to be a raw data. I would suggest to upload the processed data which you have posted in your sample using `dput(head(d2)`. You may want to see his [link](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) as well. – Prem May 28 '18 at 11:09
0

There are two possible approaches which differ only in the sequence of operations:

  1. Reshape all dataframes in the list individually from long to wide format and rbind() matching columns.
  2. rbind() all dataframes in long form and reshape to wide format afterwards.

Both approaches require to include bundle somehow.

For the sake of completeness, here are different implementations of the second approach using data.table.

library(data.table)
library(magrittr)
d2 %>% 
  # bind row-wise into large data.table, create id column
  rbindlist(idcol = "bid") %>% 
  # right join to append bundle column
  setDT(bundle)[, bid := .I][., on = "bid"] %>%
  # reshape from long to wide format
  dcast(., bundle ~ id, fill = 0)
       bundle  35 211 220 381 394 926 930 1109 1462
1:  284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0  0.2  0.2
2:  511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0  0.0  0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2  0.0  0.3
4: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0  0.0  0.5

Here, piping is used just to visualize the sequence of function calls. With data.table's chaining the statement becomes more concise:

library(data.table) # library(magrittr) not required
setDT(bundle)[, bid := .I][
  rbindlist(d2, id = "bid"), on = "bid"][, dcast(.SD, bundle ~ id, fill = 0)]

or

library(data.table) # library(magrittr) not required
dcast(setDT(bundle)[, bid := .I][
  rbindlist(d2, id = "bid"), on = "bid"], bundle ~ id, fill = 0)

Another variant is to rename the list elements before calling rbindlist() which will take the names for creating the idcol:

library(data.table)
library(magrittr)
d2 %>% 
  # rename list elements
  setNames(bundle$bundle) %>%
  # bind row-wise into large data.table, create id column from element names
  rbindlist(idcol = "bundle") %>% 
  # convert bundle from character to factor to maintain original order
  .[, bundle := forcats::fct_inorder(bundle)] %>%
  # reshape from long to wide format
  dcast(., bundle ~ id, fill = 0)
       bundle  35 211 220 381 394 926 930 1109 1462
1:  284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0  0.2  0.2
2:  511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0  0.0  0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2  0.0  0.3
4: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0  0.0  0.5

Note that the variants presented so far have skipped the empty dataframe which belongs to bundle 1048768805 (likewise the answers by Moody_Mudskipper and chinsoon12).

In order to keep the empty dataframe in the final result, the order of the join has to be changed so that all rows of bundle will be kept:

library(data.table)
dcast(
  rbindlist(d2, id = "bid")[setDT(bundle)[, bid := .I], on = "bid"], 
  bundle ~ id, fill = 0
  )[, "NA" := NULL][]
       bundle  35 211 220 381 394 926 930 1109 1462
1:  284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0  0.2  0.2
2:  511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0  0.0  0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2  0.0  0.3
4: 1048768805 0.0 0.0 0.0 0.0 0.0 0.0 0.0  0.0  0.0
5: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0  0.0  0.5

Or, if the exact order of bundle is to be maintained:

library(data.table)
dcast(
  rbindlist(d2, id = "bid")[setDT(bundle)[, bid := .I], on = "bid"], 
  bid + bundle ~ id, fill = 0
)[, c("bid", "NA") := NULL][]
       bundle  35 211 220 381 394 926 930 1109 1462
1:  284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0  0.2  0.2
2: 1048768805 0.0 0.0 0.0 0.0 0.0 0.0 0.0  0.0  0.0
3:  511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0  0.0  0.5
4: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2  0.0  0.3
5: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0  0.0  0.5
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • I have an error: _Error in rbindlist(., idcol = "bundle") : Item 3 has 3 columns, inconsistent with item 1 which has 2 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE._ – Nikita Pronin May 28 '18 at 08:29
  • Apparently, not all of the dataframes in the list have the same structure (same number, order, and type of columns). Suggested solutions: (1) Either you can fix the input data or (2) a special treatment for the deviating dataframes has to be included in the code. – Uwe May 29 '18 at 12:10