148

I have DataFrame with column Sales.

How can I split it into 2 based on Sales value?

First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
146 percent Russian
  • 2,016
  • 2
  • 14
  • 20

5 Answers5

161

You can use boolean indexing:

df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
   A  Sales
0  3     10
1  4     20
2  7     30
3  6     40
4  1     50

s = 30

df1 = df[df['Sales'] >= s]
print (df1)
   A  Sales
2  7     30
3  6     40
4  1     50

df2 = df[df['Sales'] < s]
print (df2)
   A  Sales
0  3     10
1  4     20

It's also possible to invert mask by ~:

mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
   A  Sales
2  7     30
3  6     40
4  1     50

print (df2)
   A  Sales
0  3     10
1  4     20

print (mask)
0    False
1    False
2     True
3     True
4     True
Name: Sales, dtype: bool

print (~mask)
0     True
1     True
2    False
3    False
4    False
Name: Sales, dtype: bool
jpobst
  • 3,491
  • 2
  • 25
  • 24
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 16
    Is there a way to do it without having to slice the dataframe twice? Because this way we'll have to roll over index onve to create df1, and another time for the exact same condition for df2. But I can't figure out how to get both dataframes in a single line.. – ysearka Jun 07 '16 at 11:55
  • 1
    Unfortunately I think there is only this solution - see [cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#splitting). – jezrael Jun 07 '16 at 12:04
  • whats the performance difference between using `mask` vs traditional slicing? My tests show mask is a bit faster, but not a huge difference – 3pitt Mar 27 '18 at 19:20
  • @Mike Palmice Hmm, if think first vs second paragraph it is from performance same, it is up you what method is nicer for you. Second should be better if need combine multiple times or if need reuse mask, also if some long complicated masks then second should be more readable. – jezrael Mar 27 '18 at 19:27
  • Can this be implemented for each single value in the sales column? Let's say I had a Categorical column with values A, B, and C. Is there a way to split the df into three without actually specifying the values to do so, but just saying "create a new df for each unique value of the column". Thanks! – Odisseo Sep 13 '19 at 04:39
  • 1
    Not exactly but I figured it out by doing a for loop; iterating through each unique column value, then splitting the df by the value by slicing it. Not too hard actually, I don't even know why I asked. Thanks though. – Odisseo Sep 13 '19 at 05:58
  • @Odisseo - Not 100% sure if understand, what is expected output? for ech group 2 DataFrames by mask? then the best should be forst filter like answer above and then split by `Categorical` like link in comment above. – jezrael Sep 13 '19 at 06:09
  • Yeah don’t worry about it. It’s harder to explain in this box than to actually code it. I would add my code but I don’t think you’re supposed to in the comments. – Odisseo Sep 13 '19 at 06:13
78

Using groupby you could split into two dataframes like

In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]

In [1048]: df1
Out[1048]:
   A  Sales
2  7     30
3  6     40
4  1     50

In [1049]: df2
Out[1049]:
   A  Sales
0  3     10
1  4     20
Zero
  • 74,117
  • 18
  • 147
  • 154
  • 4
    This operation appears to be substantially more expensive than jezrael's two options, though syntactically more elegant imo – 3pitt Mar 27 '18 at 19:23
  • 4
    This is not fully equivalent to jezrael's options. If after the split one of the data sets is empty then group by will return list with just one element and it will fail to unpack into df1 and df2. – hgrey Jun 06 '19 at 09:06
63

Using groupby and list comprehension:

Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.

DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]

accessing the separated DF like this:

ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF

accessing the column value of the separated DF like this:

ansI_chr=ans[i].chr 
ZachB
  • 13,051
  • 4
  • 61
  • 89
keryruo
  • 741
  • 5
  • 4
16

One-liner using the walrus operator (Python 3.8):

df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]

Consider using copy to avoid SettingWithCopyWarning:

df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()

Alternatively, you can use the method query:

df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
9

I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:

df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}

This should do it if you wanted to go based on categorical groups.

Art
  • 2,836
  • 4
  • 17
  • 34
Ryan L
  • 101
  • 1
  • 1