Pandas split DataFrame by column value

Question

I have DataFrame with column Sales.

How can I split it into 2 based on Sales value?

First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s

score 161 · Accepted Answer · edited Apr 11 '18 at 12:23

161

You can use boolean indexing:

df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
   A  Sales
0  3     10
1  4     20
2  7     30
3  6     40
4  1     50

s = 30

df1 = df[df['Sales'] >= s]
print (df1)
   A  Sales
2  7     30
3  6     40
4  1     50

df2 = df[df['Sales'] < s]
print (df2)
   A  Sales
0  3     10
1  4     20

It's also possible to invert mask by ~:

mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
   A  Sales
2  7     30
3  6     40
4  1     50

print (df2)
   A  Sales
0  3     10
1  4     20

print (mask)
0    False
1    False
2     True
3     True
4     True
Name: Sales, dtype: bool

print (~mask)
0     True
1     True
2    False
3    False
4    False
Name: Sales, dtype: bool

edited Apr 11 '18 at 12:23

jpobst

3,491
2
25
24

answered Nov 16 '15 at 19:13

jezrael

822,522
95
1,334
1,252

16

Is there a way to do it without having to slice the dataframe twice? Because this way we'll have to roll over index onve to create df1, and another time for the exact same condition for df2. But I can't figure out how to get both dataframes in a single line.. – ysearka Jun 07 '16 at 11:55
1

Unfortunately I think there is only this solution - see [cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#splitting). – jezrael Jun 07 '16 at 12:04
whats the performance difference between using `mask` vs traditional slicing? My tests show mask is a bit faster, but not a huge difference – 3pitt Mar 27 '18 at 19:20
@Mike Palmice Hmm, if think first vs second paragraph it is from performance same, it is up you what method is nicer for you. Second should be better if need combine multiple times or if need reuse mask, also if some long complicated masks then second should be more readable. – jezrael Mar 27 '18 at 19:27
Can this be implemented for each single value in the sales column? Let's say I had a Categorical column with values A, B, and C. Is there a way to split the df into three without actually specifying the values to do so, but just saying "create a new df for each unique value of the column". Thanks! – Odisseo Sep 13 '19 at 04:39
1

Not exactly but I figured it out by doing a for loop; iterating through each unique column value, then splitting the df by the value by slicing it. Not too hard actually, I don't even know why I asked. Thanks though. – Odisseo Sep 13 '19 at 05:58
@Odisseo - Not 100% sure if understand, what is expected output? for ech group 2 DataFrames by mask? then the best should be forst filter like answer above and then split by `Categorical` like link in comment above. – jezrael Sep 13 '19 at 06:09
Yeah don’t worry about it. It’s harder to explain in this box than to actually code it. I would add my code but I don’t think you’re supposed to in the comments. – Odisseo Sep 13 '19 at 06:13

score 78 · Answer 2 · answered Oct 04 '17 at 19:46

78

Using groupby you could split into two dataframes like

In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]

In [1048]: df1
Out[1048]:
   A  Sales
2  7     30
3  6     40
4  1     50

In [1049]: df2
Out[1049]:
   A  Sales
0  3     10
1  4     20

answered Oct 04 '17 at 19:46

Zero

74,117
18
147
154

4

This operation appears to be substantially more expensive than jezrael's two options, though syntactically more elegant imo – 3pitt Mar 27 '18 at 19:23
4

This is not fully equivalent to jezrael's options. If after the split one of the data sets is empty then group by will return list with just one element and it will fail to unpack into df1 and df2. – hgrey Jun 06 '19 at 09:06

score 63 · Answer 3 · edited Sep 30 '22 at 21:58

63

Using groupby and list comprehension:

Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.

DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]

accessing the separated DF like this:

ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF

accessing the column value of the separated DF like this:

ansI_chr=ans[i].chr

edited Sep 30 '22 at 21:58

ZachB

13,051
4
61
89

answered Dec 11 '19 at 03:37

keryruo

741
5
4

This is a great answer! – Taylrl May 13 '20 at 13:53
11

I think you can simplify to `ans = [y for x, y in DF.groupby('chr', as_index=False)]` since y is already a DataFrame – C8H10N4O2 Jul 10 '20 at 20:09
2

This answer doesn't depend on the amount of splits. it should be voted #1 Just need to update it according to @C8H10N4O2 's comment – msarafzadeh Aug 31 '20 at 11:54
1

I believe `as_index=False` only affects aggregation, so you can simplify further to `ans = [y for x, y in DF.groupby('chr')]` – David Gilbertson Aug 31 '22 at 23:18

Mykola Zotko · Answer 4 · 2021-09-15T09:34:59.180

16

One-liner using the walrus operator (Python 3.8):

df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]

Consider using copy to avoid SettingWithCopyWarning:

df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()

Alternatively, you can use the method query:

df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()

edited Sep 15 '21 at 09:34

answered Jun 10 '21 at 09:58

Mykola Zotko

15,583
3
71
73

Honestly, I find this more readable, haha – Wing Shum Jun 20 '22 at 06:03

score 9 · Answer 5 · edited Oct 27 '21 at 07:51

9

I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:

df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}

This should do it if you wanted to go based on categorical groups.

edited Oct 27 '21 at 07:51

Art

2,836
4
17
34

answered Oct 26 '21 at 20:08

Ryan L

101
1
1

Pandas split DataFrame by column value

5 Answers5

Linked

Related