6

I'm trying to put some matrices in a dataframe in R, something like :

m <- matrix(c(1,2,3,4), nrow=2, ncol=2)
df <- data.frame(id=1, mat=m)

But when I do that, I get a dataframe with 2 rows and 3 columns instead of a dataframe with 1 row and 2 columns.

Reading the documentation, I have to escape my matrix using I().

df <- data.frame(id=1, mat=I(m))

str(df)
'data.frame':   2 obs. of  2 variables:
 $ id : num  1 1
 $ mat: AsIs [1:2, 1:2] 1 2 3 4

As I understand it, the dataframe contains one row for each row of the matrix, and the mat field is a list of matrix column values.

Thus, how can I obtain a dataframe containing matrices ?

Thanks !

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Scharron
  • 17,233
  • 6
  • 44
  • 63
  • Despite my answer, I have some sympathy with the other respondent: why do you want to do this? Perhaps we can help you find a better R idiom for doing it ... – Ben Bolker May 26 '11 at 22:20
  • I have data with inputs and outputs being matrices. I wanted each experience to be a row of a dataframe. – Scharron May 27 '11 at 10:13
  • 1
    Recent advances in the tidyverse family of packages, particularly purrr, make it useful to create nested columns of arbitrary data types for the purpose of functional programming. Nested columns of matrices may be useful as a preparatory step for transforming each matrix into a simpler structure. – David Bruce Borenstein Apr 13 '17 at 20:23

6 Answers6

7

I find data.frames containing matrices mind-bendingly weird, but: the only way I know to achieve this is hidden in stats:::simulate.lm

Try this, poke through and see what's happening:

d <- data.frame(y=1:5,n=5)
g0 <- glm(cbind(y,n-y)~1,data=d,family=binomial)
debug(stats:::simulate.lm)
s <- simulate(g0,n=5)

This is the weird, back-door solution. Create a list, change its class to data.frame, and then (this is required) set the names and row.names manually (if you don't do those final steps the data will still be in the object, but it will print out as though it had zero rows ...)

m1 <- matrix(1:10,ncol=2)
m2 <- matrix(5:14,ncol=2)
dd <- list(m1,m2)
class(dd) <- "data.frame"
names(dd) <- LETTERS[1:2]
row.names(dd) <- 1:5
dd
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
5

I came across the same problem trying to understand the gasoline data in pls package. Used $ for the job. First, lets create a matrix, lets call it spectra_mat, then a vector called response_var1.

spectra_mat = matrix(1:45, 9, 5)
response_var1 = seq(1:9)

Now we put the vector response_var1 in a new data frame - lets call it df.

df = data.frame(response_var1)
df$spectra = spectra_mat

To check,

str(df)

'data.frame':   9 obs. of  2 variables:
 $ response_var1: int  1 2 3 4 5 6 7 8 9
 $ spectra      : int [1:9, 1:5] 1 2 3 4 5 6 7 8 9 10 ...
zoc99
  • 105
  • 1
  • 5
5

A much easier way to do this is to define the data frame with a placeholder for the matrix

m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2) 
df <- data.frame(id = 1, mat = rep(0, nrow(m)))

Then to assign the matrix. No need to play with the class of a list or to use an *apply() function.

df$mat <- m
adamleerich
  • 5,741
  • 2
  • 18
  • 20
  • Although this leaves you with the matrix being turned INTO a column in the dataframe. Might be ok for some application (and you can just access elements by i*nrow + ncol), but it is limiting if your matrices are of different sizes. – Three Diag Apr 06 '16 at 11:07
3

Data frames containing matrix columns do have their uses in specialized scenarios. These scenarios are cases when you have a whole vector of some variable for every observation in your data set. There are two cases that I have come across where this is common:

  1. Bayesian analysis: you create a posterior prediction for each observation, so for every "row" in your newdata, you have an entire vector (the length of that vector is the number of MCMC iterations).
  2. Functional data analysis: each "observation" is itself a function, and you store the observed realization of that function as a vector.

If you're working with data frames, there are a few obvious ways to handle this data that are both inefficient. I'll use the Bayesian case as an example:

  1. "Super-wide" format: you have one column for each element of the vectors, in addition to your other columns of the data frame. This makes an extremely wide data frame that is often hard to work with. It also makes it difficult to refer to only those columns that correspond to the posterior.
  2. "Super-long" (tidy) format: very memory intensive because all of the other columns of your data frame have to be repeated unnecessarily for every iteration of the posterior.
  3. List-columns: you can create a list where each element is the vector corresponding to the posterior for that row of the data frame. The problem here is that most of the manipulation you want to do will require you to unlist the posterior back to a matrix, and the listing/unlisting is unnecessary compuation.

Data frames with matrix columns are a very useful solution to this situation. The posterior stays in a matrix that has the same number of rows as the data frame. But that matrix only is recognized as a single "column" in the data frame, and referring to that column using df$mat will return the matrix. You can even use some dplyr functions like filtering to return the corresponding rows of the matrix, but this is a bit experimental.

The easiest method to create the matrix column is in two steps. First create the data frame without the matrix column, then add the matrix column with a simple assignment. I haven't found a 1-step solution to do this that doesn't involve I() which changes the column type.

m <- matrix(c(1,2,3,4), nrow=2, ncol=2)
df <- data.frame(id = rep(1, nrow(m)))
df$mat <- m
names(df)
# [1] "id"  "mat"
str(df)
# 'data.frame': 2 obs. of  2 variables:
#  $ id : num  1 1
#  $ mat: num [1:2, 1:2] 1 2 3 4
dww
  • 30,425
  • 5
  • 68
  • 111
Jonathan Gellar
  • 303
  • 1
  • 8
1

The result you got (2 rows x 3 columns) is what is to be expected from R, as it amounts to cbind a vector (id, with recycling) and a matrix (m).

IMO, it would be better to use list or array (when dimensions agree, no mix of numeric and factors values allowed), if you really want to bind different data structures. Otherwise, just cbind your matrix to an existing data.frame if both have the same number of rows will do the job. For example

x1 <- replicate(2, rnorm(10))
x2 <- replicate(2, rnorm(10))
x12l <- list(x1=x1, x2=x2)
x12a <- array(rbind(x1, x2), dim=c(10,2,2))

and the results reads

> str(x12l)
List of 2
 $ x1: num [1:10, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...
 $ x2: num [1:10, 1:2] -0.164 0.709 -0.268 -1.464 0.744 ...
> str(x12a)
 num [1:10, 1:2, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...

Lists are easier to use if you plan to use matrix of varying dimensions, and providing they are organized in the same way (for rows) as an external data.frame you can subset them as easily. Here is an example:

df1 <- data.frame(grp=gl(2, 5, labels=LETTERS[1:2]), 
                  age=sample(seq(25,35), 10, rep=T))
with(df1, tapply(x12l$x1[,1], list(grp, age), mean))

You can also use lapply (for list) and apply (for array) functions.

chl
  • 27,771
  • 5
  • 51
  • 71
0

To get a data.frame with 1 row and 2 columns for the given example you have to put the matrix inside a list.

m <- matrix(1:4, 2)

x <- list2DF(list(id=1, mat=list(m)))
x
#  id        mat
#1  1 1, 2, 3, 4

str(x)
#'data.frame':   1 obs. of  2 variables:
# $ id : num 1
# $ mat:List of 1
#  ..$ : int [1:2, 1:2] 1 2 3 4


y <- data.frame(id=1, mat=I(list(m)))
y
#  id        mat
#1  1 1, 2, 3, 4

str(y)
#'data.frame':   1 obs. of  2 variables:
# $ id : num 1
# $ mat:List of 1
#  ..$ : int [1:2, 1:2] 1 2 3 4
#  ..- attr(*, "class")= chr "AsIs"

To create a data.frame with a column containing a matrix, with the given data with 2 rows and 2 columns, directly when creating the data.frame using I() will be straight forward. An alternative without AsIs could be to insert it later, as already shown by others.

m <- matrix(1:4, 2)

x <- data.frame(id=1, mat=I(m))
str(x)
'data.frame':   2 obs. of  2 variables:
 $ id : num  1 1
 $ mat: 'AsIs' int [1:2, 1:2] 1 2 3 4

y <- data.frame(id=rep(1, nrow(m)))
y[["m"]] <- m
#y["m"] <- m   #Alternative
#y[,"m"] <- m  #Alternative
#y$m <- m      #Alternative
str(y)
#'data.frame':   2 obs. of  2 variables:
# $ id: num  1 1
# $ m : int [1:2, 1:2] 1 2 3 4

z <- `[<-`(data.frame(id=rep(1, nrow(m))), , "mat", m)
str(z)
#'data.frame':   2 obs. of  2 variables:
# $ id : num  1 1
# $ mat: int [1:2, 1:2] 1 2 3 4

Alternatively the data can be stored in a list.

m <- matrix(1:4, 2)
x <- list(id=1, mat=m)
x
#$id
#[1] 1
#
#$mat
#     [,1] [,2]
#[1,]    1    3
#[2,]    2    4

str(x)
#List of 2
# $ id : num 1
# $ mat: int [1:2, 1:2] 1 2 3 4
GKi
  • 37,245
  • 2
  • 26
  • 48