0

I'm seeking a dplyr-ish solution to the following task. I have a data frame that contains a variable that is a list of lists which has an attribute dimnames. The lists are of different lengths. Here's the output to str(df):

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3 obs. of  2 variables:
 $ Step : int  1 2 3
 $ Value:List of 3
  ..$ : num [1:2, 1:2] 0.232 0.261 0.932 0.875
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr  "4" "5"
  .. .. ..$ : chr  "0.2" "0.094"
  ..$ : num [1:2, 1:5] 0.197 0.197 0.64 0.643 0.958 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr  "4" "5"
  .. .. ..$ : chr  "0.2" "0.094" "0.044" "0.021" ...
  ..$ : num [1:2, 1] 0.268 0.262
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr  "4" "5"
  .. .. ..$ : chr "0.2"

I've included dput code below to recreate this dataframe.

I want a dataframe in the following format:

Step    Value   a     b
 1      0.232   4   0.200
 1      0.261   5   0.200
 1      0.932   4   0.094
 1      0.875   5   0.094
 1       NA     4   0.044
 1       NA     5   0.044
 1       NA     4   0.021
 1       NA     5   0.021
 1       NA     4   0.010
 1       NA     5   0.010
 2      0.197   4   0.200
 2      0.197   5   0.200
 2      0.640   4   0.094
 2      0.643   5   0.094
 2      0.958   4   0.044
 2      1.032   5   0.044
 2      0.943   4   0.021
 2      1.119   5   0.021
 2      0.943   4   0.010
 2      1.119   5   0.010
 3      0.268   4   0.200
 3      0.262   5   0.200
 3       NA     4   0.094
 3       NA     5   0.094
 3       NA     4   0.044
 3       NA     5   0.044
 3       NA     4   0.021
 3       NA     5   0.021
 3       NA     4   0.010
 3       NA     5   0.010

where the variable a are the row names of the list of lists dimnames and b are the column names.

I've tried a for loop to separate out each list by step, but

  1. I've not been successful in padding out the list with NAs (length(x) <- y doesn't work).

  2. I've reviewed advanced R data types but haven't been successful in extracting the dimnames into vectors to use as dataframe columns (attr(df$Value, "dimnames") yields NULL.)

Once I have lists of the same length I can construct the new dataframe vectors step by step in the for loop and then rbind. Or is there a way to use the dimname attribute to directly construct a wide dataframe using both row and column dimnames as dataframe column names? I can then gather to make a long dataframe.

There's several subquestions here, and I'm sure there's a more elegant solution than the one I've mapped out. Thanks for looking.

Here's the dput code to create the dataframe:

df <- structure(list(Step = c(1L, 2L, 3L), Value = list(structure(c(0.232, 
0.261, 0.932, 0.875), .Dim = c(2L, 
2L), .Dimnames = list(c("4", "5"), c("0.2", "0.094"
))), structure(c(0.197, 0.197, 0.640, 
0.643, 0.958, 1.032, 0.943, 
1.119, 0.943, 1.119), .Dim = c(2L, 
5L), .Dimnames = list(c("4", "5"), c("0.2", "0.094", 
"0.044", "0.021", "0.01"))), structure(c(0.268, 
0.262), .Dim = c(2L, 1L), .Dimnames = list(c("4", 
"5"), "0.2")))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L), .Names = c("Step", "Value"))
user438383
  • 5,716
  • 8
  • 28
  • 43
zazizoma
  • 437
  • 1
  • 7
  • 18
  • Looks like that variable is list of matrices, not a lists of lists? – Axeman Jun 01 '17 at 17:24
  • Could be, `str(df$Value[[1]])` yields `num [1:2, 1:2] 0.232 0.261 0.932 0.875 - attr(*, "dimnames")=List of 2 ..$ : chr [1:2] "4" "5" ..$ : chr [1:2] "0.2" "0.094"`, but then isn't a matrix a list with dimensional attributes, and I still need to convert to a long dataframe using the dimnames as variables. I can't figure a way to untangle this object. – zazizoma Jun 01 '17 at 17:37
  • Yeah ok, `num [1:2, 1:2]` means numerical array with two dimension, i.e. a matrix. It's only the dimnames that are a list. – Axeman Jun 01 '17 at 17:39

2 Answers2

1

Approach one:

First, we get the matrices to data.frames, then we add the rownames as a separate column called a, and gather them all. By unnesting we get one big data.frame. Adding in the NA values is easy with complete

library(tidyverse) # using dplyr, tidyr and purrr

df %>% 
  mutate(Value = map(Value, as.data.frame),
         Value = map(Value, rownames_to_column, 'a'),
         Value = map(Value, ~gather(., b, value, -a))) %>% 
  unnest(Value) %>% 
  complete(Step, a, b)

Approach two:

Manually define the data.frame, then do the same:

df %>% 
  mutate(Value = map(Value, 
                     ~data_frame(val = c(.), 
                                 a = rep(rownames(.), each = ncol(.)),
                                 b = rep(colnames(.), nrow(.))))) %>% 
  unnest(Value) %>% 
  complete(Step, a, b))

Result:

Both give:

# A tibble: 30 × 4
    Step     a     b value
   <int> <chr> <chr> <dbl>
1      1     4  0.01    NA
2      1     4 0.021    NA
3      1     4 0.044    NA
4      1     4 0.094 0.932
5      1     4   0.2 0.232
6      1     5  0.01    NA
7      1     5 0.021    NA
8      1     5 0.044    NA
9      1     5 0.094 0.875
10     1     5   0.2 0.261
# ... with 20 more rows
Axeman
  • 32,068
  • 8
  • 81
  • 94
  • I'm speechless, so quickly and not one but two options. Let me play with these and I'll get back to you. I'm sure I'll have questions. – zazizoma Jun 01 '17 at 17:52
  • That does it, and fast with my real dataframe. I had no idea it would be THIS easy and I've picked up several nifty tricks. I've used map with functions, but not with value setting before, and I'd not seen complete. What does the ~ do? – zazizoma Jun 01 '17 at 18:12
  • It's one of the ways to define an anonymous function in `purrr`. See the `.f` argument of `?map`. – Axeman Jun 01 '17 at 20:58
1

Not really a dplyr solution, but you could do:

## Get the maximum length in l$Value and the index where it is observed
m = max(lengths(l$Value))
[1] 10
j = which.max(lengths(l$Value))
[1] 2

Then construct a dataframe for each element of l$Value, rbind them together and add the Step column:

l2 = lapply(l$Value,function(x) data.frame(a=rep(row.names(x),length.out=m),
Value=x[1:m],b=rep(colnames(l$Value[[j]]),length.out=m)))
df = do.call(rbind,l2)
df$Step = rep(l$Step,each=m)

This returns:

   a Value     b Step
1  4 0.232   0.2    1
2  5 0.261 0.094    1
3  4 0.932 0.044    1
4  5 0.875 0.021    1
5  4    NA  0.01    1
6  5    NA   0.2    1
7  4    NA 0.094    1
8  5    NA 0.044    1
9  4    NA 0.021    1
10 5    NA  0.01    1
11 4 0.197   0.2    2
12 5 0.197 0.094    2
13 4 0.640 0.044    2
14 5 0.643 0.021    2
15 4 0.958  0.01    2
16 5 1.032   0.2    2
17 4 0.943 0.094    2
18 5 1.119 0.044    2
19 4 0.943 0.021    2
20 5 1.119  0.01    2
21 4 0.268   0.2    3
22 5 0.262 0.094    3
23 4    NA 0.044    3
24 5    NA 0.021    3
25 4    NA  0.01    3
26 5    NA   0.2    3
27 4    NA 0.094    3
28 5    NA 0.044    3
29 4    NA 0.021    3
30 5    NA  0.01    3
Lamia
  • 3,845
  • 1
  • 12
  • 19