How to compare three columns of a csv file and determine the missing data with python?

Question

I have a CSV file (screenshot attached) dataset

There I have three different columns and the first one, "Set of Ids known for transformers" is the master column. I need to compare the other two columns with the master column and filter out the missing values in the remaining two columns.

Can anyone please tell me how to do this thing in python with the pandas library?

Thanks in advance.

Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — jezrael, Dec 06 '19 at 06:00
[Please don't post images of code/data (or links to them)](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question) — jezrael, Dec 06 '19 at 06:00

jeremy_rutman · Answer 1 · 2019-12-06T10:20:32.807

You can remove any row missing both nonmaster columns using a condition to select rows:

import pandas as pd
df = pd.read_excel('soxl.xlsx')
df=df[df['IDs of phase known'].notnull() | df['Ids of distance known'].notnull()]

If you wanted to remove any row with any missing data you can use the builtin :

df = df.dropna()

which removes any row with missing values (which otherwise get imported as Nan values).

If you want to replace the 2nd col. with master column values when they are missing, you can do

df = df.where((pd.notnull(df)), None)

to replace the NaN's with None (useful in the next step) and then

df['IDs of phase known']= df.apply(lambda r:r['IDs of phase known'] if r['IDs of phase known'] else r['Set of Ids known for transformers'],axis=1)

And of course you can do the same with the 3rd col. If you want to replace only in the case where both col2 and 3 values are missing, you can do similarly, but check both columns for None:

df['IDs of phase known']= df.apply(lambda r:r['Set of Ids known for transformers'] if not (r['IDs of phase known'] or r['Ids of distance known']) else r['IDs of phase known'] ,axis=1 )

You can also do this, actually a bit easier, with numpy:

df['IDs of phase known'] = np.where(np.isnan(df['IDs of phase known']), df['Set of Ids known for transformers'], df['IDs of phase known'])

If you want to just replace missing values with master column vals, or

df['IDs of phase known'] = np.where( (np.isnan(df['IDs of phase known']) & (np.isnan(df['Ids of distance known']))),df['Set of Ids known for transformers'],df['IDs of phase known'])

If you only want to replace in cases where both cols are missing.

Hi @ jeremy_rutman, thank you for your answer. But this is not what I am looking for. Actually what is want is to compare the other two columns with the master column and find the values which are missing in those two columns. — gayan_lanke, Dec 06 '19 at 06:27
so you want to drop any row that's only missing values in both the two other columns? — jeremy_rutman, Dec 06 '19 at 06:31
What I want is to compare the first column with the other two columns and output the values which are missing for the last two columns separately. — gayan_lanke, Dec 06 '19 at 07:25
What values do you want to output in the case of missing fields, you want to fill in the value from the master column? — jeremy_rutman, Dec 06 '19 at 09:16

How to compare three columns of a csv file and determine the missing data with python?

1 Answers1