40

I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with. I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.

There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes. How might I do this?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Pascal
  • 411
  • 1
  • 4
  • 4

2 Answers2

55

Make your own grouping variable.

d <- split(my_data_frame,rep(1:400,each=1000))

You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.

edited for brevity, after Hadley's comments.

If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do

chunk <- 1000
n <- nrow(my_data_frame)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)

You could also use

r <- ggplot2::cut_width(1:n,chunk,boundary=0)

For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like

(my_data_frame 
   %>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
   %>% group_by(index)
   %>% [mutate, summarise, do()] ...
)

There are also many answers here

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • thanks! that works great! And yes, I'll look into the plyr package as it seems very useful. – Pascal Aug 15 '11 at 00:03
  • Why are you using `split.data.frame` and not `split`? And you don't need to coerce the grouping variable to a factor. – hadley Aug 15 '11 at 07:33
  • wasn't sure whether those coercions/method dispatches would work, and was too lazy to take the time to test. thanks. – Ben Bolker Aug 15 '11 at 11:39
  • 1
    This only seems to work if you know how many data frames you want to output. How would you use this if you don't know how many data frames will be produced, especially for dataframes with variable lengths which you want split into smaller data frames with no more than a certain number of rows? – user5359531 Mar 16 '16 at 17:06
  • 2
    `rep` also takes the length.out argument, so you could write `split(my_data_frame, rep(1:ceiling(nrow(df)/chunk), each=chunk, length.out=nrow(df)))` as an alternative in the more complicated case. – lmo Mar 29 '17 at 14:40
  • How would you modify it, if I only know number of chunks and size of dataframe does divided evenly by chunks, so last chunk has different size. – user1700890 Feb 19 '18 at 21:39
  • how about `index <- rep(1:ngrps,each=full_number)[1:nrow(my_data_frame)]` ? – Ben Bolker Feb 20 '18 at 00:46
  • 1
    What are the data.table and dplyr options? – Omar Gonzales Aug 02 '20 at 17:03
9

I had a similar question and used this:

library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)

from left to right:

  • you assign your result to split
  • you start with df as your input dataframe
  • then you group your data by dividing the row_number by n (number of groups) using modular division.
  • then you just pass that group through the group_map function which returns a list.

So in the end your split is a list with in each element a group of your dataset. On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).

You can find more info on these powerful tools on: Cheat sheet of dplyr explaining group_by and also below for: group_map, group_walk follow up functions

Jurgen Van Impe
  • 181
  • 2
  • 2
  • 4
    You can shorten this and do `df %>% group_split(group_id = row_number() %/% n)`. Or in base, `split(df, seq(nrow(df)) %/% n)` –  Aug 27 '21 at 00:45