0

I'm working with three dataframes: df_1, df_2, and df_3.

They have different numbers of rows and columns, and different information. Each dataframe is indexed by Country name, so that is what connects them all.

The idea is to find the intersection of the three and determine how many unique elements are lost when we do that intersecting.

I start by calling those df's from the function they were created in:

df_1, df_2, df_3 = load_data()

merged_1 = pd.merge(df_1, df_2, how = 'inner', left_index = True, right_index = True)
merged_2 = pd.merge(merged_1, df_3, how = 'inner', left_index = True, right_index = True)

unique_df_1 = pd.merge(df_1, merged_2, how = 'left', left_index = True, right_index = True, 
                          indicator = True).query('_merge=="left_only"')

unique_df_2 = pd.merge(df_2, merged_2, how = 'left', left_index = True, right_index = True, 
                      indicator = True).query('_merge=="left_only"')

unique_df_3 = pd.merge(df_3, merged_2, how = 'left', left_index = True, right_index = True,
                          indicator = True).query('_merge=="left_only"')

return (len(unique_df_1)+len(unique_df_2)+len(unique_df_3))

This is my very first post on stack overflow, so I hope I did everything correctly. Apologies if I did not or if my writing was not clear.

Hefe
  • 421
  • 3
  • 23
  • Welcome to SO, you have quite the wall of text there, unfortunately you are missing a [mre] please provide one, there should at least be sample input (copy paste) and expected output. – Andreas May 19 '21 at 23:36
  • Hey Andreas, thank you! Will do. – Hefe May 19 '21 at 23:49

1 Answers1

1
df1 = pd.DataFrame({'z1':range(6), 'z2':[5,3,6,9,2,4]}, index=list('abcdef'))
df2 = pd.DataFrame({'x1':range(4), 'x2':[10,20,30, 40]}, index=list('abhi'))
df3 = pd.DataFrame({'y1':range(5), 'y2':[11,22,33, 44, 55]}, index=list('abktf'))

print(df1)

z1  z2
a   0   5
b   1   3
c   2   6
d   3   9
e   4   2
f   5   4

print(df2)

x1  x2
a   0  10
b   1  20
h   2  30
i   3  40

print(df3)

   y1  y2
a   0  11
b   1  22
k   2  33
t   3  44
f   4  55

df4 = pd.merge(df1, df2, left_index=True, right_index=True)
df5 = pd.merge(df4, df3, left_index=True, right_index=True)

print(df5)

   z1  z2  x1  x2  y1  y2
a   0   5   0  10   0  11
b   1   3   1  20   1  22

Are you trying to find the unique/missing countries? You can do it in a much simpler way:

set_1 = set(df1.index)
set_2 = set(df2.index)
set_3 = set(df3.index)

print(set_1)
print(set_2)
print(set_3)

>> {'d', 'b', 'e', 'a', 'f', 'c'}
>> {'a', 'i', 'b', 'h'}
>> {'b', 't', 'f', 'a', 'k'}

You can get the intersection between all three sets by:

set_4 = set_1.intersection(set_2).intersection(set_3)
print(set_4)
>> {'a', 'b'}

Missing countries from df1:

print(set_1.difference(set_4))
>> {'d', 'f', 'c', 'e'}

Let me know if this isn't what you're looking for. Or if there are any particular data-points you're trying to compare.

  • Hey Saksham. Thanks for your answer! That's definitely helpful, and thank you for interpreting what I wrote. You're solution got me the same answer as the method I tried, unfortunately! What I'm wondering is how to get everything but the purple section in this image: https://dhanapalan.com/wp-content/uploads/2017/03/Venn-diagram.png – Hefe May 20 '21 at 01:34
  • I figured it out. Some of the data in the excel file had superscript numbers, so when I removed those, everything added up correctly. Thanks for the help! – Hefe May 20 '21 at 02:45