Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
48
votes
2 answers

python pandas, DF.groupby().agg(), column reference in agg()

On a concrete problem, say I have a DataFrame DF word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10 I want to find, for every "word", the "tag" that has the most "count". So the…
jf328
  • 6,841
  • 10
  • 58
  • 82
23
votes
2 answers

ddply + summarize for repeating same statistical function across large number of columns

Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925…
Reuben L.
  • 2,806
  • 2
  • 29
  • 45
13
votes
1 answer

pandas: get all groupby values in an array

I'm sure this has been asked before, sorry if duplicate. Suppose I have the following dataframe: df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(6)}, columns=['key', 'data']) >> key data 0 A 0 1 …
ru111
  • 813
  • 3
  • 13
  • 27
10
votes
2 answers

How to use dplyr to calculate a weighted mean of two grouped variables

I know this must be super easy, but I'm having trouble finding the right dplyr commands to do this. Let's say I want to group a dataset by two variables, and then summarize the count for each row. For this we simply have: mtcars %>% group_by(cyl,…
ds_guy
  • 143
  • 2
  • 5
7
votes
4 answers

R: split-apply-combine for geographic distance

I have downloaded a list of all the towns and cities etc in the US from the census bureau. Here is a random sample: dput(somewhere) structure(list(state = structure(c(30L, 31L, 5L, 31L, 24L, 36L, 13L, 21L, 6L, 10L, 31L, 28L, 10L, 5L, 5L, 8L, 23L,…
jvalenti
  • 604
  • 1
  • 9
  • 31
7
votes
1 answer

Using groupby with expanding and a custom function

I have a dataframe that consists of truthIds and trackIds: truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C'] trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6] df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId}) …
Tara S
  • 143
  • 1
  • 5
7
votes
2 answers

Find half of each group with Pandas GroupBy

I need to select half of a dataframe using the groupby, where the size of each group is unknown and may vary across groups. For example: index summary participant_id 0 130599 17.0 13 1 130601 18.0 …
Arnold Klein
  • 2,956
  • 10
  • 31
  • 60
7
votes
2 answers

Programmatically calling group_by() on a varying variable

Using dplyr, I'd like to summarize [sic] by a variable that I can vary (e.g. in a loop or apply-style command). Typing the names in directly works fine: library(dplyr) ChickWeight %>% group_by( Chick, Diet ) %>% summarise( mw = mean( weight ) ) But…
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
6
votes
1 answer

Adding rows in `dplyr` output

In traditional plyr, returned rows are added automagically to the output even if they exceed the number of input rows for that grouping: set.seed(1) dat <- data.frame(x=runif(10),g=rep(letters[1:5],each=2)) > ddply( dat, .(g), function(df)…
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
5
votes
2 answers

Select rows of a DataFrame containing minimum of grouping variable in Julia

I'm wondering if there is an efficient way to do the following in Julia: I have a DataFrame of the following form: julia> df1 = DataFrame(var1=["a","a","a","b","b","b","c","c","c"], var2=["p","q","r","p","p","r","q","p","p"], …
5
votes
1 answer

normalizing data by duplication

note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread I have a 'dataset' with the…
Lorinc Nyitrai
  • 968
  • 1
  • 10
  • 27
5
votes
3 answers

Pandas multiindex dataframe set first row in a column to 0

I am having some trouble working on grouped objects in pandas. Specifically, I want to be able to set the first row in a column to 0 while keeping other rows unchanged. For example: df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2, …
Ganesh Sundar
  • 311
  • 3
  • 11
4
votes
3 answers

how to add key variables to `dplyr::group_map()`?

I have the following, but want to add the group_by() key Species to the resulting tibble: MWE iris %>% group_by(Species) %>% group_map(~ broom::tidy(lm(Sepal.Length ~ Sepal.Width, data = .x))) %>% bind_rows() Output # How do I add the…
mkk
  • 879
  • 6
  • 19
4
votes
1 answer

purrr split %>% map %>% bind VERSUS dplyr group_by %>% do

I am often in the position of wanting to split-apply-combine regression models. I've found two ways of doing it, the "purrr" approach and the "dplyr::do()" approach. Issue with the purrr approach: I want columns in the resulting data.frame to…
Alex Coppock
  • 2,122
  • 3
  • 15
  • 31
4
votes
2 answers

MATLAB: return both arguments from ISMEMBER when used inside SPLITAPPLY

How can I access both arguments of ismember when it is used inside splitapply? slitapply only returns scalar values for each group, so in order to compute nonscalar values for each group (as returned by the first argument of ismemebr), one has to…
Confounded
  • 446
  • 6
  • 19
1
2 3
10 11