4

How can I partition a matrix or dataframe into N equally-sized chunks with R? I want to cut the matrix or dataframe horizontally.

For example, given:

r = 8
c = 10
number_of_chunks = 4
data = matrix(seq(r*c), nrow = r, ncol=c)
>>> data

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    9   17   25   33   41   49   57   65    73
[2,]    2   10   18   26   34   42   50   58   66    74
[3,]    3   11   19   27   35   43   51   59   67    75
[4,]    4   12   20   28   36   44   52   60   68    76
[5,]    5   13   21   29   37   45   53   61   69    77
[6,]    6   14   22   30   38   46   54   62   70    78
[7,]    7   15   23   31   39   47   55   63   71    79
[8,]    8   16   24   32   40   48   56   64   72    80

I would like to have to cut data into a list of 4 elements:

Element 1:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    9   17   25   33   41   49   57   65    73
[2,]    2   10   18   26   34   42   50   58   66    74

Element 2:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[3,]    3   11   19   27   35   43   51   59   67    75
[4,]    4   12   20   28   36   44   52   60   68    76

Element 3:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[5,]    5   13   21   29   37   45   53   61   69    77
[6,]    6   14   22   30   38   46   54   62   70    78

Element 4:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[7,]    7   15   23   31   39   47   55   63   71    79
[8,]    8   16   24   32   40   48   56   64   72    80

With numpy in python, I can use numpy.array_split.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501

3 Answers3

5

Here's an attempt in base R. Calculate "pretty" cut values for the sequence of rows using pretty. Categorized the sequence of row numbers with cut and return a list of the the sequence split at the cut values with split. Finally, run through a list of the split row values using lapply extract the matrix subsets with [.

lapply(split(seq_len(nrow(data)),
             cut(seq_len(nrow(data)), pretty(seq_len(nrow(data)), number_of_chunks))),
       function(x) data[x, ])
$`(0,2]`
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    9   17   25   33   41   49   57   65    73
[2,]    2   10   18   26   34   42   50   58   66    74

$`(2,4]`
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    3   11   19   27   35   43   51   59   67    75
[2,]    4   12   20   28   36   44   52   60   68    76

$`(4,6]`
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    5   13   21   29   37   45   53   61   69    77
[2,]    6   14   22   30   38   46   54   62   70    78

$`(6,8]`
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    7   15   23   31   39   47   55   63   71    79
[2,]    8   16   24   32   40   48   56   64   72    80

Roll this into a function:

array_split <- function(data, number_of_chunks) {
  rowIdx <- seq_len(nrow(data))    
  lapply(split(rowIdx, cut(rowIdx, pretty(rowIdx, number_of_chunks))), function(x) data[x, ])
}

Then, you can use

array_split(data=data, number_of_chunks=number_of_chunks)

to return the same result as above.


A nice simplification suggested by @user20650 is

split.data.frame(data,
                 cut(seq_len(nrow(data)), pretty(seq_len(nrow(data)), number_of_chunks)))

A surprise to me, split.data.frame returns a list of matrices when its first argument is a matrix.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • How to use array_split for a matrix of 3510 rows that I want to split into 10 equal-sized submatrices? pretty argument splits my matrix into 8 chunks instead of 10 – siegfried Oct 27 '20 at 07:50
1
number_of_chunks = 4
lapply(seq(1, NROW(data), ceiling(NROW(data)/number_of_chunks)),
       function(i) data[i:min(i + ceiling(NROW(data)/number_of_chunks) - 1, NROW(data)),])

OR

lapply(split(data, rep(1:number_of_chunks, each = NROW(data)/number_of_chunks)),
       function(a) matrix(a, ncol = NCOL(data)))
d.b
  • 32,245
  • 6
  • 36
  • 77
1

Try to not split the data explicitly, because it's another copy. You'd rather split the indices you want to access.

With this function, you can split by number of chunks (for parallelism) or by size of the chunks.

CutBySize <- function(m, block.size, nb = ceiling(m / block.size)) {
  int <- m / nb
  upper <- round(1:nb * int)
  lower <- c(1, upper[-nb] + 1)
  size <- c(upper[1], diff(upper))
  cbind(lower, upper, size)
}

CutBySize(nrow(data), nb = number_of_chunks)

     lower upper size
[1,]     1     2    2
[2,]     3     4    2
[3,]     5     6    2
[4,]     7     8    2
F. Privé
  • 11,423
  • 2
  • 27
  • 78