1

as so often I am confused by how dataframes are accessed. I have a dataframe (lets call it df) like this:

id_one id_two sets
1 123 {234, 345, 546, ...}
2 -234 {123, 234, 645, ...}
... ... ...

... 25 Million rows worth of this.

I want to filter the dataframe and only show rows where the set in "sets" intersects with another set, lets call it reference_set = {542345, 423, 64564, 435, ...etc}. Later I want to quantify this intersection, that's why I need the length of the intersection.

This does not work:

df.loc[
    len(
        df['sets'].intersection(reference_set) 
    ) > 0]

It gives "AttributeError: 'Series' object has no attribute 'intersection'"

Shouldn't it give a Boolean list to select on? Am I not following this correctly?

Thank you for advice!

bfalk
  • 33
  • 4

1 Answers1

0

You could use pandas apply to obtain the intersection between the reference set and the set on each row of the dataframe. Then, apply the len function over the newly created column (Intersect) to quantify each intersection. This will enable you to filter the dataframe and only show rows where the set in sets intersects with another set (df['Len'] > 0).

df used as input

   id_one  id_two                                                                                         sets
0    7575     527          {1, 4, 6, 7, 8, 13, 16, 20, 24, 31, 40, 47, 50, 52, 57, 61, 64, 69, 80, 88, 91, 96}
1     574    1555      {7, 18, 19, 22, 23, 24, 30, 39, 43, 47, 50, 58, 62, 64, 72, 76, 77, 83, 84, 86, 87, 96}
2    7831    8823  {5, 14, 15, 20, 23, 28, 30, 32, 35, 36, 40, 41, 44, 52, 54, 59, 60, 62, 63, 84, 87, 90, 96}
3     688    6860           {2, 9, 20, 24, 27, 28, 30, 38, 46, 57, 59, 60, 64, 65, 69, 71, 80, 84, 88, 91, 95}
4    8843     596    {6, 7, 8, 24, 25, 27, 30, 33, 47, 50, 54, 56, 57, 61, 64, 66, 69, 74, 78, 81, 85, 88, 99}
5    1269    7546             {11, 22, 24, 25, 33, 35, 45, 48, 49, 54, 57, 59, 61, 68, 70, 75, 86, 87, 94, 95}
6    1362    4860           {2, 5, 14, 19, 23, 32, 37, 38, 47, 48, 58, 62, 65, 68, 70, 72, 73, 77, 82, 88, 91}
7    7994    7192      {2, 3, 4, 7, 9, 11, 12, 13, 15, 17, 20, 24, 25, 29, 40, 50, 57, 64, 71, 78, 89, 95, 99}
8     748    6271  {1, 12, 19, 26, 30, 34, 45, 48, 52, 60, 67, 72, 73, 74, 76, 80, 82, 84, 89, 94, 96, 98, 99}
9     553    4068  {9, 12, 15, 20, 35, 39, 40, 41, 44, 45, 50, 57, 65, 67, 68, 69, 72, 73, 79, 87, 88, 97, 98}
REFERENCE_SET = {10, 18, 23, 55, 90, 92}

df['Intersect'] = df['sets'].apply(lambda row: REFERENCE_SET.intersection(row))
df['Len'] = df['Intersect'].apply(len)

df_filtered = df[df['Len'] > 0]

Output from df_filtered

   id_one  id_two                                                                                         sets Intersect  Len
1     574    1555      {7, 18, 19, 22, 23, 24, 30, 39, 43, 47, 50, 58, 62, 64, 72, 76, 77, 83, 84, 86, 87, 96}  {18, 23}    2
2    7831    8823  {5, 14, 15, 20, 23, 28, 30, 32, 35, 36, 40, 41, 44, 52, 54, 59, 60, 62, 63, 84, 87, 90, 96}  {90, 23}    2
6    1362    4860           {2, 5, 14, 19, 23, 32, 37, 38, 47, 48, 58, 62, 65, 68, 70, 72, 73, 77, 82, 88, 91}      {23}    1
n1colas.m
  • 3,863
  • 4
  • 15
  • 28