1

I have a DataFrame with columnA with integer values that range between -3 and 89. I want to select all the rows with values in columnA between discrete 10 unit bins, e.g.

-10 to 0
0 to 10 ...
80 to 90

I can produce a list of the number of rows in each bin like this:

pd.cut(DataFrame['columnA'], np.arange(-10, 100, 10), include_lowest=True, labels=False).value_counts().sort_index().to_list()

resulting in a list like this:

[505, 25000, 21, 393, 79232, 953000, 24121, 662, 50, 900]

Now if I want to examine all the rows in the first bin, I can select them like this:

DataFrame.sort_values('columnA', ascending=True).iloc[0:505]

How can I write a function to select all the rows in bin N?

Cactus Philosopher
  • 804
  • 2
  • 12
  • 25

2 Answers2

1

How about this? I am assuming that your bins are of form [a, b[ (closed on the lower bound and open on the upper one.

df = pd.DataFrame(map(lambda x: int(x), (np.random.uniform(-3,89,1000))), columns=['ColumnA'])

def bin_func(df, N, col='ColumnA', xmin=-10, xmax=100, Nbins=10): 
    df_sorted = df.sort_values(col, ascending=True) 
    bins = np.arange(xmin, xmax, Nbins)
    if N > Nbins-1:
        return(pd.DataFrame())
    return(df_sorted[(df_sorted[col] >= bins[N]) & (df_sorted[col]<bins[N+1])])
b3rt0
  • 769
  • 2
  • 6
  • 21
  • 1
    I just edited it so that the parameters are now part of the function, in case you want to change the bins and ranges later on. – b3rt0 Oct 15 '19 at 03:38
  • Sure, it is a way to filter in pandas, select all the rows which satisfy the condition. Check this answer here https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas – b3rt0 Oct 15 '19 at 03:45
0

use DataFrame.groupby + Series.cumsum:

def get_bin(n):
     return df.groupby(df.index.isin([505, 25000, 21, 393, 79232, 953000, 24121, 662, 50, 900]).cumsum()).get_group(n)

this will divide your dataframe into groups called 0,1,2,3,4,5 ... n.You select using groupby.get_group.

Here is an example:

print(df)
   a  b  c  d
0  0  1  1  x
1  0  1  5  y
2  0  1  5  x
3  0  1  0  y
4  1  1  5  x
5  0  1  4  y
6  1  0  1  x
7  1  1  3  y
8  0  1  2  x
9  0  0  0  y

groups=df.index.isin([2,4,7,9]).cumsum()
print(groups)
#array([0, 0, 1, 1, 2, 2, 2, 3, 3, 4])

print(df.groupby(groups).get_group(0))
   a  b  c  d
0  0  1  1  x
1  0  1  5  y
ansev
  • 30,322
  • 5
  • 17
  • 31
  • `def get_bin(df, n):` when n is 0, keyerror is 0. When n is 1, all rows are returned. When n is 2, keyerror is 2. – Cactus Philosopher Oct 15 '19 at 03:17
  • I said groups of 1 onwards, if you want from n you must delete the + 1 – ansev Oct 15 '19 at 03:22
  • I have created an example to make the application clear. I may not have understood you well, excuse me. Could you show an example with the expected departure and thus be able to help you? – ansev Oct 15 '19 at 03:30
  • I have removed + 1 so that I can select from bin 0 – ansev Oct 15 '19 at 03:34