0

I would like to perform some computation on a large dataframe. To do so I need to

  1. split my dataframe into n equal parts (the last chunk will have its own size)
  2. Do my computation (I add the result in a new column)
  3. Recombine the dataframe

How can I do that?

Many thanks in advance for your help!

 dataframe <- MyDataFrame
 nb_obs <- nrow(dataframe) # in my dataframe I have 153 036 rows
 nb_chunk <- ceiling(dataframe / 250) # I thus need 613 chunks if I want 250 obs per sub-dataframe
 for(i in 1:nb_chunk) {
         # my computation here, I want to add a new columns to the chunk to store my results..
 }  

 # then I want to recombine the final dataset (equals to original dataset with a new column added)

EDITED PART

Thank you for your comments, please find below my proposal in a reproducible example using iris dataset.

I have 2 additional questions at this stage:

  1. Can I initiate 'final.df' without hard coding the column names?
  2. Is there a better way to proceed with dplyr?

    df <- iris                                        # using iris as an example (my real dataframe is 153036 rows and 17 columns)
    nb_obs <- nrow(df)                                # nb of observations in the dataframe (thus nb of operations to be performed)
    nb_obs_in_chunk <- 13                             # nb of rows per chunk
    nb_chunk <- ceiling(nb_obs / nb_obs_in_chunk)     # total nb of chunks to be created
    nb_chunk_full <- floor(nb_obs / nb_obs_in_chunk)  # nb of chunks to be created with nb_obs_in_chunk rows
    nb_obs_last_chunk <- nb_obs - nb_obs_in_chunk*nb_chunk_full # nb of rows in final chunks
    df$split.factor <- as.factor(c(rep(1:nb_chunk_full, each = nb_obs_in_chunk), rep(nb_chunk_full + 1, nb_obs_last_chunk))) # create factor to split dataframe into equal parts
    final.df <- data.frame(Sepal.Length = numeric(), Sepal.Width = numeric(), Petal.Length = numeric(), Petal.Width = numeric(), Species = factor(), split.factor = factor()) # initiate final dataframe (desired output)
    for(i in 1:nb_chunk) {
      temp_i <- df[df$split.factor == i, ]
      temp_i$NEW <- temp_i$Sepal.Length + temp_i$Sepal.Width
      final.df <- rbind(final.df, temp_i)
    }
    
cho7tom
  • 1,030
  • 2
  • 13
  • 30
  • 3
    Try `indx <- as.numeric(gl(nb_obs, 250, nb_obs)); lst <- split(dataframe, indx)` to get the datasets in a list, Then use `lapply` to loop over the `lst` and do the computation, create a new column and `unsplit`. It would be better if you provide a small reproducible example and a better description of what you want to do. – akrun Apr 08 '15 at 08:29
  • 1
    I agree with akrun... the type of computation might determine a better course of actions. Are you solely summing up columns? You might not even need to split your dataframe. `sqldf` package, for instance, could help you achieve whay you're after. – Dominic Comtois Apr 08 '15 at 08:38
  • 1
    There are some great hints how to make a good example http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – rmuc8 Apr 08 '15 at 08:46
  • I think adding a grouping column with `indx`, converting the dataset to `data.table`, assign a new column after the computation would be faster. But, it depends on showing a reproducible example. – akrun Apr 08 '15 at 08:47
  • What you want to do seems to be the classic split-apply-combine kind of operation for which there are also some nice packages like dplyr and data.table and others – talat Apr 08 '15 at 08:56
  • In any case, if you're convinced you need to split, you could add this factor to your dataframe: `dataframe$split.factor <- as.factor(c(rep(1:611, 250), rep(612, 286)))` (611 groups of 250 plus 1 of 286). And then `dataframes <- split(x = dataframe, f=dataframe$split.factor)` – Dominic Comtois Apr 08 '15 at 08:57
  • Thank you all for your comments, I editied my question to make it reproducible and I propose a solution (probably not optimized as I did not use dplyr..) – cho7tom Apr 08 '15 at 12:12

2 Answers2

4

To answer your initial questions:

df <- iris
do.call(rbind,
        lapply(split(df, rep(seq(13), length.out = 150, each = 13)),
               function(chunk) {
                 chunk$NEW = chunk$Sepal.Length + chunk$Sepal.Width
                 chunk
               })
       )

rep(seq(13), length.out = 150, each = 13) gives you the same as the split.factor column in your example and the split() function turns this into a factor.

This also answers the first of the questions in your edit: yes, you don't need to initialize the final data frame.

Note that the row names are no longer a sequence from 1 to 150 as they now contain the chunk number.

mjkallen
  • 468
  • 3
  • 12
1

I think this does it:

library("dplyr")
ddf <- iris                                        
nb_obs_in_chunk <- 13
ddf %>% mutate(id=seq(nrow(ddf)),
               chunk=cut(id,id %/% nb_obs_in_chunk)) %>%
    group_by(chunk) %>%
        mutate(NEW=Sepal.Length+Sepal.Width)

but haven't tested yet ... @mjkallen's rep(seq(13), length.out = 150, each = 13)) might be better than my %/%-based solution to getting the chunk numbers.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453