0

i have two dataframes df1=

columnA columnB columnC columnD
value1  value7  value13 value20
value2  value8  value14 value21
value3  value9  value15 value22
value4  value10 value16 value23
value5  value11 value17 value24
value6  null    null    value25

df2=

columnA columnB columnC columnD
value1  value7  value13 value20
value2  null    value14 value21
null    value9  value15 value22
value4  value10 value16 value23
value5  value11 value17 value24
value6  value12 value18 value25

i want to compare both the dataframe and i need to pick all rows which are null (missing values) after comparing both dataframes my output dataframe should be like: outputDF=

columnA columnB columnC columnD
value2  value8  value14 value21
value3  value9  value15 value22
value6  value12 value18 value25

how to achieve this using pyspark? column names is generic like they may vary as show above in dataframe. how to achieve this using generic code to fetch the missing values from both dataframes

karthik
  • 69
  • 5
  • The pyspark tag is potentially misleading as I believe this is a pandas question. If this is the case I would change the tag to pandas and you will probably be a reply more quickly. On my phone so cannot give you a solution easily. – John M. Jan 16 '23 at 11:52
  • @john-m i have changed the tag now to pandas – karthik Jan 16 '23 at 13:04

1 Answers1

1

IIUC use if both index and columns names are same:

df1 = df1.replace('null', np.nan)
df2 = df2.replace('null', np.nan)

mask = df1.isna().any(axis=1) | df2.isna().any(axis=1)

df = df.combine_first(df2)[mask]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • im getting this error "ValueError: Mixed type replacements are not supported" – karthik Jan 16 '23 at 17:41
  • @karthik - I think need [this](https://stackoverflow.com/a/52614996/2901002) – jezrael Jan 17 '23 at 06:39
  • thanks for ur comment,,, it helped me a lot... just wanted to know instead of null if any other value is their how to pick those rows? – karthik Jan 17 '23 at 09:28