5

So I have two sets of features that I wish to bin (classify) and then combine to create a new feature. It is not unlike classifying coordinates into grids on a map.

The issue is that the features are not evenly distributed and I would like to use quantiles when binning (like with pandas.qcut()) on both features/coordinates.

Is there a better way than doing qcut() on both features and then concatenating the result labels?

Reuben L.
  • 2,806
  • 2
  • 29
  • 45

1 Answers1

10

Create a cartesian product categorical.

Consider the dataframe df

df = pd.DataFrame(dict(A=np.random.rand(20), B=np.random.rand(20)))

           A         B
0   0.538186  0.038985
1   0.185523  0.438329
2   0.652151  0.067359
3   0.746060  0.774688
4   0.373741  0.009526
5   0.603536  0.149733
6   0.775801  0.585309
7   0.091238  0.811828
8   0.504035  0.639003
9   0.671320  0.132974
10  0.619939  0.883372
11  0.301644  0.882258
12  0.956463  0.391942
13  0.702457  0.099619
14  0.367810  0.071612
15  0.454935  0.651631
16  0.882029  0.015642
17  0.880251  0.348386
18  0.496250  0.606346
19  0.805688  0.401578

We can create new categoricals with pd.qcut

d1 = df.assign(
    A_cut=pd.qcut(df.A, 2, labels=[1, 2]),
    B_cut=pd.qcut(df.B, 2, labels=list('ab'))
)

           A         B A_cut B_cut
0   0.538186  0.038985     1     a
1   0.185523  0.438329     1     b
2   0.652151  0.067359     2     a
3   0.746060  0.774688     2     b
4   0.373741  0.009526     1     a
5   0.603536  0.149733     1     a
6   0.775801  0.585309     2     b
7   0.091238  0.811828     1     b
8   0.504035  0.639003     1     b
9   0.671320  0.132974     2     a
10  0.619939  0.883372     2     b
11  0.301644  0.882258     1     b
12  0.956463  0.391942     2     a
13  0.702457  0.099619     2     a
14  0.367810  0.071612     1     a
15  0.454935  0.651631     1     b
16  0.882029  0.015642     2     a
17  0.880251  0.348386     2     a
18  0.496250  0.606346     1     b
19  0.805688  0.401578     2     b

You can create the cartesian product categorical with tuples

d2 = d1.assign(cartesian=pd.Categorical(d1.filter(regex='_cut').apply(tuple, 1)))
print(d2)

           A         B A_cut B_cut cartesian
0   0.538186  0.038985     1     a    (1, a)
1   0.185523  0.438329     1     b    (1, b)
2   0.652151  0.067359     2     a    (2, a)
3   0.746060  0.774688     2     b    (2, b)
4   0.373741  0.009526     1     a    (1, a)
5   0.603536  0.149733     1     a    (1, a)
6   0.775801  0.585309     2     b    (2, b)
7   0.091238  0.811828     1     b    (1, b)
8   0.504035  0.639003     1     b    (1, b)
9   0.671320  0.132974     2     a    (2, a)
10  0.619939  0.883372     2     b    (2, b)
11  0.301644  0.882258     1     b    (1, b)
12  0.956463  0.391942     2     a    (2, a)
13  0.702457  0.099619     2     a    (2, a)
14  0.367810  0.071612     1     a    (1, a)
15  0.454935  0.651631     1     b    (1, b)
16  0.882029  0.015642     2     a    (2, a)
17  0.880251  0.348386     2     a    (2, a)
18  0.496250  0.606346     1     b    (1, b)
19  0.805688  0.401578     2     b    (2, b)

If you were so inclined, you could even declare an ordering for them.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Looks great. Thanks! Will accept this answer within a reasonable timeframe if no other challengers appear. – Reuben L. Apr 15 '17 at 06:42
  • Thanks! I'm using now: `df["bucket"] = list(zip(pd.qcut(df["A"], 10, labels=list(range(10))), pd.qcut(df["B"], 10, labels=list(range(10)))))` based on your answer. – AxelWass Mar 25 '22 at 14:55