Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.
Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:
- Splitting data by the value of one or more variables
- Applying a function to each chunk of data independently
- Combining the data back into one piece
Examples of split-apply-combine operations would be:
- Computing median income by country from individual-level data (possibly appending the result to the same data)
- Generating highest score for each class from student scores
Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):
In the R statistical environment there are dedicated packages for this purpose
- data.table is an extension of
data.frame
that is optimized for split-apply-combine operations among other things - dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
- data.table is an extension of
In Python, the pandas library introduces data objects that include a group-by method for this type of operation.