I am doing an experiment and want to observe the impact of missing values on the query results. I am doing it using Python Pandas. Consider that I have dataframe df
. This dataframe is the complete data. My real data consists of many columns and thousands of rows.
I made a copy of df
to df_copy
. Then I do an experiment using df_copy
and df
is the ground truth. I put some NaN values on df_copy
randomly.
I have some ideas to fix the missing values on df_copy
using a heuristic ways. Currently, I can do easily using row operation in pandas. For instance, if I want to fix any rows on df_copy
, I just can get the row by the id from df_copy
then drop the row and replace from the df
.
My question is, how can I do an operation on a cell-based in pandas? For instance, How can I get the index (x,y) from all missing values and when I want to fix a missing cell, I can just replace the value on that cell from the ground truth by calling the index (x,y)
Example:
df
df = pd.DataFrame(np.array([["x", 2, 3], ["y", 5, 6], ["z", 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x 2 3
1 y 5 6
2 z 8 9
df_copy
df_copy = pd.DataFrame(np.array([["x", np.nan, 3], ["y", 5, np.nan], [np.nan, 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x nan 3
1 y 5 nan
2 nan 8 9