Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Splitting data by the value of one or more variables
Applying a function to each chunk of data independently
Combining the data back into one piece

Examples of split-apply-combine operations would be:

Computing median income by country from individual-level data (possibly appending the result to the same data)
Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

In the R statistical environment there are dedicated packages for this purpose
- data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
- dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions

votes

2 answers

python pandas, DF.groupby().agg(), column reference in agg()

On a concrete problem, say I have a DataFrame DF word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10 I want to find, for every "word", the "tag" that has the most "count". So the…

asked Mar 10 '13 at 13:16

jf328

6,841
10
58
82

votes

2 answers

ddply + summarize for repeating same statistical function across large number of columns

Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925…

r multiple-columns plyr idioms split-apply-combine

asked May 28 '12 at 16:19

Reuben L.

2,806
2
29
45

votes

1 answer

pandas: get all groupby values in an array

I'm sure this has been asked before, sorry if duplicate. Suppose I have the following dataframe: df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(6)}, columns=['key', 'data']) >> key data 0 A 0 1 …

pandas pandas-groupby split-apply-combine

asked Mar 12 '19 at 15:51

ru111

votes

2 answers

How to use dplyr to calculate a weighted mean of two grouped variables

I know this must be super easy, but I'm having trouble finding the right dplyr commands to do this. Let's say I want to group a dataset by two variables, and then summarize the count for each row. For this we simply have: mtcars %>% group_by(cyl,…

r dplyr weighted-average summarize split-apply-combine

asked Apr 24 '18 at 01:15

ds_guy

votes

4 answers

R: split-apply-combine for geographic distance

I have downloaded a list of all the towns and cities etc in the US from the census bureau. Here is a random sample: dput(somewhere) structure(list(state = structure(c(30L, 31L, 5L, 31L, 24L, 36L, 13L, 21L, 6L, 10L, 31L, 28L, 10L, 5L, 5L, 8L, 23L,…

r list dataframe geocoding split-apply-combine

asked Nov 10 '21 at 15:19

jvalenti

votes

1 answer

Using groupby with expanding and a custom function

I have a dataframe that consists of truthIds and trackIds: truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C'] trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6] df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId}) …

python pandas lambda pandas-groupby split-apply-combine

asked Feb 06 '18 at 18:51

Tara S

votes

2 answers

Find half of each group with Pandas GroupBy

I need to select half of a dataframe using the groupby, where the size of each group is unknown and may vary across groups. For example: index summary participant_id 0 130599 17.0 13 1 130601 18.0 …

python pandas pandas-groupby split-apply-combine

asked Jun 27 '17 at 19:42

Arnold Klein

2,956
10
31
60

votes

2 answers

Programmatically calling group_by() on a varying variable

Using dplyr, I'd like to summarize [sic] by a variable that I can vary (e.g. in a loop or apply-style command). Typing the names in directly works fine: library(dplyr) ChickWeight %>% group_by( Chick, Diet ) %>% summarise( mw = mean( weight ) ) But…

r group-by dplyr split-apply-combine

asked Feb 08 '15 at 00:22

Ari B. Friedman

71,271
35
175
235

votes

1 answer

Adding rows in `dplyr` output

In traditional plyr, returned rows are added automagically to the output even if they exceed the number of input rows for that grouping: set.seed(1) dat <- data.frame(x=runif(10),g=rep(letters[1:5],each=2)) > ddply( dat, .(g), function(df)…

r dplyr split-apply-combine

asked May 13 '14 at 01:10

Ari B. Friedman

71,271
35
175
235

votes

2 answers

Select rows of a DataFrame containing minimum of grouping variable in Julia

I'm wondering if there is an efficient way to do the following in Julia: I have a DataFrame of the following form: julia> df1 = DataFrame(var1=["a","a","a","b","b","b","c","c","c"], var2=["p","q","r","p","p","r","q","p","p"], …

group-by julia minimum split-apply-combine

asked Nov 26 '20 at 15:29

Kayvon Coffey

votes

1 answer

normalizing data by duplication

note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread I have a 'dataset' with the…

python pandas split-apply-combine

asked Aug 22 '16 at 11:23

Lorinc Nyitrai

votes

3 answers

Pandas multiindex dataframe set first row in a column to 0

I am having some trouble working on grouped objects in pandas. Specifically, I want to be able to set the first row in a column to 0 while keeping other rows unchanged. For example: df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2, …

python pandas multi-index split-apply-combine

asked Aug 12 '14 at 16:44

Ganesh Sundar

votes

3 answers

how to add key variables to `dplyr::group_map()`?

I have the following, but want to add the group_by() key Species to the resulting tibble: MWE iris %>% group_by(Species) %>% group_map(~ broom::tidy(lm(Sepal.Length ~ Sepal.Width, data = .x))) %>% bind_rows() Output # How do I add the…

r dplyr split-apply-combine

asked Sep 06 '21 at 00:10

mkk

votes

1 answer

purrr split %>% map %>% bind VERSUS dplyr group_by %>% do

I am often in the position of wanting to split-apply-combine regression models. I've found two ways of doing it, the "purrr" approach and the "dplyr::do()" approach. Issue with the purrr approach: I want columns in the resulting data.frame to…

r group-by dplyr purrr split-apply-combine

asked May 05 '19 at 19:06

Alex Coppock

2,122
3
15
31

votes

2 answers

MATLAB: return both arguments from ISMEMBER when used inside SPLITAPPLY

How can I access both arguments of ismember when it is used inside splitapply? slitapply only returns scalar values for each group, so in order to compute nonscalar values for each group (as returned by the first argument of ismemebr), one has to…

matlab cell-array split-apply-combine

asked Nov 06 '17 at 11:02

Confounded

2 3

…

10 11 Next