I want to segment a dataset containing items (labeled with IDs), and multiple categorical features that take different values (for instance, color takes 'blue', 'orange', 'green'; size takes 'S', 'M', 'L', brand takes 'Brand A', 'Brand B', etc.):
ID | Brand | Color | Size | Price |
---|---|---|---|---|
1 | Brand 1 | Orange | S | 23 |
2 | Brand 2 | Blue | XXL | 3 |
3 | Brand 1 | Green | XXXL | 45 |
4 | Brand 2 | Blue | M | 200 |
I can easily do it by hand for 1 or 2 features (with a small number of values). E.G. if I segment by brand I get:
ID | Brand | Color | Size | Price |
---|---|---|---|---|
1 | Brand 1 | Orange | S | 23 |
3 | Brand 1 | Green | XXXL | 45 |
and
ID | Brand | Color | Size | Price |
---|---|---|---|---|
2 | Brand 2 | Blue | XXL | 3 |
4 | Brand 2 | Blue | M | 200 |
Unfortunately, some features take 10+ values. Moreover, the number of subsets explodes if I want to segment according to more than 1 feature for segmentation. I am trying to test different levels of segmentation (e.g. color + brand, color+brand+size) which is why I don't do it by hand.
I am trying to figure out a function that take the dataframe and a list of features in input and that output all the different subsets but for now, my code is worthless.
Thank you in advance if you think you can help me!