1

I have two dataframes. df1 & df2. I want to filter df2 based on the values of df1 and then count the frequency of values and compare with df1 again.

Example:

df1:

Project_Number
S100
S100
S200
S300
S300
S300
S400
S400

df2:

Project_Number
S100
S200
S200
S300
S300
S300
S500

Now first filter the df2 based on the values of df1. Keep only those values which is present in df1 also.

df2_new:

Project_Number
S100
S200
S200
S300
S300
S300

Now take the frequency of both the dataframes-

df1['Count'] = df1['Project_Number'].map(df1['Project_Number'].value_counts())
df2_new['Count'] = df2_new['Project_Number'].map(df2_new['Project_Number'].value_counts())

df1-                              df2_new
Project_Number  Count             Project_Number   Count
S100            2                 S100             1
S200            1                 S200             2
S300            3                 S300             3
S400            2

Now take the difference between above 2 dataframes and print the result-

df_difference-

Project_Number  
S100           
S200
S400  
Shivika
  • 209
  • 3
  • 15

2 Answers2

1

To filter, try using isin

df2_new = df2[df2["Project_Number"].isin(df1["Project_Number"].unique().tolist())]
Raymond Toh
  • 779
  • 1
  • 8
  • 27
1

Use Series.isin for filtering, then compare Series from Series.value_counts for not equal and convert sorted index to one column DataFrame:

df2_new = df2.loc[df2["Project_Number"].isin(df1["Project_Number"]),'Project_Number']

s1 = df1['Project_Number'].value_counts()
s2 = df2_new.value_counts()

df_difference = s1[s1.ne(s2)].sort_index().index.to_frame(name='Project_Number',index=False)
print (df_difference)
  Project_Number
0           S100
1           S200
2           S400
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252