1

I have a DataFrame containing 2 columns x and y that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut() but as far as I can tell it can be applied only to 1 column.

For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4x4).

I tried a very direct approach which obviously didn't produce the right result (look at 51 and 99 for example). Here is the code:

df['x_bin']=pd.qcut(df.x,4)                                                
df['y_bin']=pd.qcut(df.y,4)                                                
grouped=df.groupby([df.x_bin,df.y_bin]).count()                                       
print(grouped)

The output:

x_bin                      y_bin                                 
(7.976999999999999, 7.984] (-219.17600000000002, -219.17]  51  51
                           (-219.17, -219.167]             60  60
                           (-219.167, -219.16]             64  64
                           (-219.16, -219.154]             99  99
(7.984, 7.986]             (-219.17600000000002, -219.17]  76  76
                           (-219.17, -219.167]             81  81
                           (-219.167, -219.16]             63  63
                           (-219.16, -219.154]             53  53
(7.986, 7.989]             (-219.17600000000002, -219.17]  78  78
                           (-219.17, -219.167]             77  77
                           (-219.167, -219.16]             68  68
                           (-219.16, -219.154]             51  51
(7.989, 7.993]             (-219.17600000000002, -219.17]  70  70
                           (-219.17, -219.167]             55  55
                           (-219.167, -219.16]             77  77
                           (-219.16, -219.154]             71  71

Am I making a mistake in thinking it is possible to do with pandas only or am I missing something else?

Alex.Kh
  • 572
  • 7
  • 15
  • See how to provide an example df for others to use https://stackoverflow.com/q/20109391/6692898 – RichieV Jul 30 '20 at 15:55

1 Answers1

1

The problem is that the distribution of the rows might not be the same according to x than according to y.

You are empirically mimicking a correlation analysis and finding out that there is slight negative relation... the y values are higher in the lower end of x scale and rather flat on the higher end of x.

So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)


Edit

Something like:

split_df = [(xbin,  xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x

split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
    for xbin, xdf in split_df] # now each xdf is evenly cut on y

Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64 on both xbins and ybins.

RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Thanks for the answer. When you are talking about applying `qcut()` for the 2nd time, do you mean that the bins will be taken from the result of `qcut()` on x or something else? Also, main main point was actually not to have the same cut points(or intervals) but rather than the sample size would be approximately the same for each group – Alex.Kh Jul 30 '20 at 02:22
  • I did manage to use your solution to split the data evenly as I wanted. As a last question, can I assume that position of the samples is the the same as in the original data. Ex., I have 1094 points in total. With your method I get groups of 69 samples each (16 groups in total). Does it mean that the first 69 records in the original DataFrame are in the first group, second 69 in the second,etc.(the order is maintained)? – Alex.Kh Jul 30 '20 at 15:39
  • Not at all, a point in the first bin (lowest x, lowest y) might happen to be the last in your `df`. However, within each xbin/ybin, the order or the rows will be as you would get if you filtered the full `df` by xbin/ybin https://pandas.pydata.org/docs/user_guide/groupby.html#groupby-sorting – RichieV Jul 30 '20 at 15:49
  • If this fixes your problem then you can accept the answer https://stackoverflow.com/help/someone-answers... or wait for better answer – RichieV Jul 30 '20 at 15:53
  • Sure, I wanted to do that anyway. Thanks again – Alex.Kh Jul 30 '20 at 16:05