Creating Binary Matrix in R with a large dataset

Question

I am working with R version 3.2.3 on a linux machine.

I have a dataset with 1,374,439 observations of 145 variables. I need to transform this dataframe into a binary matrix.

I have looked at different forums and I found a solution with the package reshape2 and the functions melt() and dcast(). This works perfectly with a small dataset (I always first tried my codes on a small parts in order to check if it is doing what I want). When I want to use this code on the whole dataset, it does not work anymore.

I have looked at others forums and I have tried (with no success), the following functions:

table()
sparseMatrix() and as.Matrix()
xtabs()

I have also found the use of the packages dplyr and tidyr for bigger datasets. But I did not succeed. Honestly, I struggled with the understanding too. But it seems that the size of my dataset is the main problem...

The data looks like that (this is a version short version):

    Code_1  Code_2  Code_3  Code_4  Code_5  Code_6  Code_7
1   M201            M2187   M670
2   O682            O097    Z370             O48    O759
3   S7211   Z966            Z501

And I would like to have this (a binary matrix):

    M201    M2187   M670   O682   O097  Z370    O48   0759    S7211  Z966    Z501
1   1          1       1      0      0     0      0      0        0     0       0
2   0          0       0      1      1     1      1      1        0     0       0
3   0          0       0      0      0     0      0      0        1     1       1

I would also like to be precise, that the blank spaces are not NAs. It is really blank spaces.

maybe of interest http://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r?answertab=votes#tab-top — user20650, Jul 11 '16 at 19:13

akrun · Answer 1 · 2016-07-11T14:34:29.377

We can apply table after melting the data into long formatand remove the blank (''`) elements.

library(reshape2)
table(droplevels(subset(melt(as.matrix(df1)), value!='', select = -2)))
#    value
# Var1 M201 M2187 M670 O097 O48 O682 O759 S7211 Z370 Z501 Z966
#    1    1     1    1    0   0    0    0     0    0    0    0
#    2    0     0    0    1   1    1    1     0    1    0    0
#    3    0     0    0    0   0    0    0     1    0    1    1

Or using dplyr/tidyr

library(dplyr)
library(tidyr)
data_frame(rn = rep(1:nrow(df1), ncol(df1)), v1 = unlist(df1)) %>% 
        filter(v1!="") %>%
        group_by(rn, v1) %>% 
        summarise(n = n()) %>%
        spread(v1, n, fill = 0)
#    rn  M201 M2187  M670  O097   O48  O682  O759 S7211  Z370  Z501  Z966
#   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1     1     1     1     1     0     0     0     0     0     0     0     0
#2     2     0     0     0     1     1     1     1     0     1     0     0
#3     3     0     0     0     0     0     0     0     1     0     1     1

Or using dcast from data.table

library(data.table)
dcast(data.table(rn = rep(1:nrow(df1), ncol(df1)), 
                   v1 = unlist(df1))[v1!=''], rn~v1, length)
#    rn M201 M2187 M670 O097 O48 O682 O759 S7211 Z370 Z501 Z966
#1:  1    1     1    1    0   0    0    0     0    0    0    0
#2:  2    0     0    0    1   1    1    1     0    1    0    0
#3:  3    0     0    0    0   0    0    0     1    0    1    1

score 2 · Accepted Answer · answered Jul 11 '16 at 15:13

2

What you actually need - is Matrix::sparse.model.matrix() function. Answers below create dense matrices, which will quickly eat all your ram on this dataset.

Here is simple example:

M = sparse.model.matrix( ~ ., data=data.frame(x = letters , y = LETTERS))

If you do not need intercept, use following formula

M = sparse.model.matrix( ~ -1 + ., data=data.frame(x = letters , y = LETTERS))

answered Jul 11 '16 at 15:13

Dmitriy Selivanov

4,545
1
22
38

Thank you for your answer. May I ask what represent "x" and "y" values? And if I understand you right, it will allow me to use it on big dataset? Or I completely misunderstand it? Thank you again for your answer. – Robin.N. Jul 21 '16 at 13:26
they are just some abstract columns in my example. You should pass your data frame to `sparse.model.matrix`: `M = sparse.model.matrix( ~ ., data=your_data_frame)` – Dmitriy Selivanov Jul 21 '16 at 19:38
I use the your answer. I would like to have the result into a dataframe format at the end (like I show in my question). I have looked at several options (such as summary, as.data.frame, ...). But the result is not what I want. So I am not sure now if I use your function correctly (since the output is not self-explaining to me) or I transform the result in a wrong way. – Robin.N. Aug 18 '16 at 13:10
You wrote "And I would like to have this (a binary matrix)". You got sparse matrix. It makes almost no sense to convert it to dense data.frame - you probably will go out of memory. If for some reason you still want to obtain data.frame - use `as.data.frame(as.matrix(M))` – Dmitriy Selivanov Aug 18 '16 at 14:49

bgoldst · Answer 3 · 2016-07-11T15:58:27.370

Here's a base R approach:

## define input data.frame
data <- data.frame(Code_1=c('M201','O682','S7211'),Code_2=c('','','Z966'),Code_3=c('M2187',
'O097',''),Code_4=c('M670','Z370','Z501'),Code_5=c('','',''),Code_6=c('','O48',''),Code_7=c(
'','O759',''),stringsAsFactors=F);

## coerce to a matrix to speed up subsequent operations
data <- as.matrix(data);

## solution
im <- which(arr.ind=T,data!='');
u <- unique(data[im[order(im[,'row'],im[,'col']),]]);
res <- matrix(0L,nrow(data),length(u),dimnames=list(NULL,u));
res[cbind(im[,'row'],match(data[im],u))] <- 1L;
res;
##      M201 M2187 M670 O682 O097 Z370 O48 O759 S7211 Z966 Z501
## [1,]    1     1    1    0    0    0   0    0     0    0    0
## [2,]    0     0    0    1    1    1   1    1     0    0    0
## [3,]    0     0    0    0    0    0   0    0     1    1    1

Benchmarking

library(microbenchmark);
library(reshape2);
library(dplyr);
library(tidyr);
library(data.table);

akrun1 <- function(df1) table(droplevels(subset(melt(as.matrix(df1)),value!='',select=-2)));
akrun2 <- function(df1) data_frame(rn=rep(1:nrow(df1),ncol(df1)),v1=unlist(df1)) %>% filter(v1!="") %>% group_by(rn,v1) %>% summarise(n=n()) %>% spread(v1,n,fill=0) %>% ungroup() %>% select(-rn);
akrun3 <- function(df1) dcast(data.table(rn=rep(1:nrow(df1),ncol(df1)),v1=unlist(df1))[v1!=''],rn~v1,length,value.var='v1')[,!'rn',with=FALSE];
bgoldst <- function(data) { data <- as.matrix(data); im <- which(arr.ind=T,data!=''); u <- unique(data[im[order(im[,'row'],im[,'col']),]]); res <- matrix(0L,nrow(data),length(u),dimnames=list(NULL,u)); res[cbind(im[,'row'],match(data[im],u))] <- 1L; res; };

harmonize <- function(res) {
    res <- as.matrix(if ('table'%in%class(res)) unclass(res) else res);
    res <- res[,order(colnames(res))];
    res <- res[do.call(order,as.data.frame(res)),];
    res;
}; ## end harmonize()

## OP's example
data <- data.frame(Code_1=c('M201','O682','S7211'),Code_2=c('','','Z966'),Code_3=c('M2187','O097',''),Code_4=c('M670','Z370','Z501'),Code_5=c('','',''),Code_6=c('','O48',''),Code_7=c('','O759',''),stringsAsFactors=F);

ex <- harmonize(akrun1(data));
all.equal(ex,harmonize(akrun2(data)),check.attributes=F);
## [1] TRUE
all.equal(ex,harmonize(akrun3(data)),check.attributes=F);
## [1] TRUE
all.equal(ex,harmonize(bgoldst(data)),check.attributes=F);
## [1] TRUE

microbenchmark(akrun1(data),akrun2(data),akrun3(data),bgoldst(data));
## Unit: microseconds
##           expr      min        lq      mean   median       uq       max neval
##   akrun1(data) 1155.945 1287.2345 1356.0013 1356.301 1396.072  1745.678   100
##   akrun2(data) 4053.292 4313.7315 4639.1197 4544.664 4763.408  6839.875   100
##   akrun3(data) 5866.965 6115.4320 6542.8618 6353.848 6601.886 11951.178   100
##  bgoldst(data)  108.197  144.1195  162.6198  162.936  180.684   240.769   100

## scale test
set.seed(1L);
NR <- 1374439L; NC <- 145L; NU <- as.integer(11/7*NC); probBlank <- 10/21;
repeat { u <- paste0(sample(LETTERS,NU,T),sprintf('%03d',sample(0:999,NU,T))); if (length(u)==NU) break; };
data <- setNames(nm=paste0('Code_',seq_len(NC)),as.data.frame(matrix(sample(c('',u),NR*NC,T,c(probBlank,rep((1-probBlank)/NU,NU))),NR)));

ex <- harmonize(akrun1(data));
all.equal(ex,harmonize(akrun2(data)),check.attributes=F);
## Error: cannot allocate vector of size 1.5 Gb
all.equal(ex,harmonize(akrun3(data)),check.attributes=F);
## Error: cannot allocate vector of size 1.5 Gb
all.equal(ex,harmonize(bgoldst(data)),check.attributes=F);
## [1] "Mean relative difference: 1.70387"

microbenchmark(akrun1(data),bgoldst(data),times=1L);
## Unit: seconds
##           expr       min        lq      mean    median        uq       max neval
##   akrun1(data) 101.81215 101.81215 101.81215 101.81215 101.81215 101.81215     1
##  bgoldst(data)  30.82899  30.82899  30.82899  30.82899  30.82899  30.82899     1

I don't know exactly why my result is not identical to akrun1(), but it appears his result is incorrect, since he has non-binary values in it:

unique(c(ex));
## [1] 0 1 2 3 4 5 6 7 8

Having `im` and `u`, the sparse alternative is `sparseMatrix(im[, "row"], match(data[im], u), x = 1L, dimnames = list(NULL, u))` in case some memory can be saved — alexis_laz, Jul 11 '16 at 15:16

Creating Binary Matrix in R with a large dataset

3 Answers3

Benchmarking