1

I am trying to calculate the total number of unique interactions that exist between the categorical features in a dataset.

Assume a small dataframe:

           Fruit          Vegetable       Animal
---------------------------------------------------
0          Apple          Carrot          Rabbit
1          Apple          Lemon           Fish
2          Banana         Cucumber        Cat
3          Orange         Lemon           Fish
4          Melon          Lettuce         Cat
5          Mango          Lemon           Fish
---------------------------------------------------

How do I calculate the total number of unique pairwise interactions between the features? The fruit column has 5 unique cats, the vegetable column has 4 unique cats and the animal column has 3 unique cats. So the sum of all possible combinations for all three columns if I am not mistaken is 5 x 4 x 3 = 60. However, I would like to calculate the number of possible pairwise combinations that exist in the given dataset.

So for example, Apple-Carrot is one, Carrot-Rabbit is another. Lemon-Fish also counts as one, despite appearing three times in the dataset.

Pleastry
  • 394
  • 3
  • 19
  • Does this answer your question? [pandas unique values multiple columns](https://stackoverflow.com/questions/26977076/pandas-unique-values-multiple-columns) – Latra Jan 11 '22 at 16:41
  • are you looking for all unique tuple combinations between (fruit and vegetable) and (fruit and animal) and (vegetable and animal)? – Golden Lion Jan 11 '22 at 16:56
  • @Golden Lion yes exactly. I only need to count the number of unique tuples from my data – Pleastry Jan 11 '22 at 18:19
  • have you tried cross product ? after taking the cross product you can apply unique on the tuple resultset – Golden Lion Jan 11 '22 at 18:32

2 Answers2

0

You can do that finding first all possible combinations of the categories, and then all possible combinations of the uniques elements inside the categories:

# find unique elements per column
unique_elements_per_column={i:df[i].unique() for i in df.columns}
# create possible category combinations
category_pairs=list(itertools.combinations(df.columns,2))
# find all possible combinations by category pair
all_possible_combinations=[list(itertools.product(unique_elements_per_column[i[0]],unique_elements_per_column[i[1]])) for i in category_pairs]
# sum of all possible combinations by category pair
sum_combinations=[len(i) for i in all_possible_combinations]
df_out=pd.DataFrame(columns=['all_possible_combinations'],index=category_pairs,data=sum_combinations)


#output:
                         all_possible_combinations
    (Fruit, Vegetable)                          20
    (Fruit, Animal)                             15
    (Vegetable, Animal)                         12
Gabriele
  • 333
  • 1
  • 7
  • I guess I didn't phrase my question properly. Your answer corresponds to the possible pairs that can be produced using the unique categories. Meanwhile, I am looking to count the unique pairs that exist in my data – Pleastry Jan 11 '22 at 18:25
0

This is my solution to my own problem. There might be another way to reach the same result though:


import pandas as pd

df = pd.DataFrame({"Fruit": ["Apple", "Apple", "Banana", "Orange", "Melon", "Mango"],
                    "Vegetable": ["Carrot", "Lemon", "Cucumber", "Lemon", "Lettuce", "Lemon"],
                    "Animal": ["Rabbit", "Fish", "Cat", "Fish", "Cat", "Fish"]})

columns = ["Fruit", "Vegetable", "Animal"]
unq_sum = 0
consumed = []

for col in columns:

    consumed.append(col)
    others = [x for x in columns if x not in consumed]
    for inner_col in others:
        unq_sum += df.groupby(col)[inner_col].nunique().sum()

print(unq_sum)
>>> 16

@Gabriele's answer is for finding the total number of possible unique pair combinations. In this dataset, the existing entries correspond to 16/47 = 34.04% of all possible pairwise unique combinations.

Pleastry
  • 394
  • 3
  • 19