Concat 2 dataframes and drop duplicates + consider the date and time to replace the NaN values by data from one of the dataframes

Question

Good morning,

I'm working on this code :

df1 = pd.DataFrame()
df1['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "01/08/2021", "01/08/2021", "02/08/2021", "02/08/2021"]
df1['Time'] = ["06:48:00", "06:48:00", "06:56:00", "06:56:00", "07:14:00", "07:14:00", "07:40:00", "07:40:00", "08:42:00", "08:42:00", "08:52:00", "08:52:00", "09:07:00", "09:07:00"]
df1["Column1"] = ['NaN', 'NaN', 0.038807581, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854, 0.000383506, 0.000543031, 0.000342, 'NaN', 'NaN']
df1["Column2"] = [0.000270475, 0.000313769,  'NaN', 'NaN', 0.000483506, 0.000643031,  0.000533131,  0.000543031,  0.000342, 0.056263517, 0.042163517, 0.035163517, 0.025163517, 0.026363517]
df2 = pd.DataFrame()
df2['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "01/08/2021", "01/08/2021", "02/08/2021", "02/08/2021"]
df2['Time'] = ["06:48:00", "06:48:00", "06:56:00", "06:56:00", "07:14:00", "07:14:00", "07:40:00", "07:40:00", "08:42:00", "08:42:00", "08:52:00", "08:52:00", "09:07:00", "09:07:00"]
df2["Column1"] = [0.041807581, 0.019607581, 'NaN', 'NaN', 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854, 0.000383506, 'NaN', 'NaN', 0.000313769, 0.000413769]
df2["Column2"] = [0.000270475, 0.000313769,  0.000383506,  0.000583506, 'NaN', 'NaN',  0.000533131,  0.000543031, 'NaN', 'NaN', 0.042163517, 0.035163517, 0.025163517, 0.026363517]
diff_df = pd.concat([df1, df2]).drop_duplicates().reset_index(drop=True)

The output is like this :

+------------+----------+-------------+-------------+
|    Date    |   Time   |   Column1   |   Column2   |
+------------+----------+-------------+-------------+
| 29/07/2021 | 06:48:00 | NaN         | 0,000270475 |
| 29/07/2021 | 06:48:00 | NaN         | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,038807581 | NaN         |
| 29/07/2021 | 06:56:00 | 0,018807581 | NaN         |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,000483506 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,000643031 |
| 30/07/2021 | 07:40:00 | 0,026561283 | 0,000533131 |
| 30/07/2021 | 07:40:00 | 0,027743659 | 0,000543031 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,000342    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,056263517 |
| 01/08/2021 | 08:52:00 | 0,000543031 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,000342    | 0,035163517 |
| 02/08/2021 | 09:07:00 | NaN         | 0,025163517 |
| 02/08/2021 | 09:07:00 | NaN         | 0,026363517 |
| 29/07/2021 | 06:48:00 | 0,041807581 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,019607581 | 0,000313769 |
| 29/07/2021 | 06:56:00 | NaN         | 0,000383506 |
| 29/07/2021 | 06:56:00 | NaN         | 0,000583506 |
| 30/07/2021 | 07:14:00 | 0,025931434 | NaN         |
| 30/07/2021 | 07:14:00 | 0,025163517 | NaN         |
| 31/07/2021 | 08:42:00 | 0,028854    | NaN         |
| 31/07/2021 | 08:42:00 | 0,000383506 | NaN         |
| 01/08/2021 | 08:52:00 | NaN         | 0,042163517 |
| 01/08/2021 | 08:52:00 | NaN         | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,000313769 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,000413769 | 0,026363517 |
+------------+----------+-------------+-------------+

What i need is that if there is NaN value for one of the dataframes, it will consider to use the other one value considering the fact that they have the same date and time. which is something like this :

+------------+----------+-------------+-------------+
|    Date    |   Time   |   Column1   |   Column2   |
+------------+----------+-------------+-------------+
| 29/07/2021 | 06:48:00 | 0,041807581 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,019607581 | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,038807581 | 0,000383506 |
| 29/07/2021 | 06:56:00 | 0,018807581 | 0,000583506 |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,000483506 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,000643031 |
| 30/07/2021 | 07:40:00 | 0,026561283 | 0,000533131 |
| 30/07/2021 | 07:40:00 | 0,027743659 | 0,000543031 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,000342    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,056263517 |
| 01/08/2021 | 08:52:00 | 0,000543031 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,000342    | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,000313769 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,000413769 | 0,026363517 |
+------------+----------+-------------+-------------+

Thank you for your time and have a great day !

EDIT######## Using these lines :

diff_df.Column1 = diff_df.Column1.fillna(diff_df.Column2)
diff_df.Column2 = diff_df.Column2.fillna(diff_df.Column1)

Will give me this output which is not what i needed :

+------------+----------+-------------+-------------+
|    Date    |   Time   |   Column1   |   Column2   |
+------------+----------+-------------+-------------+
| 29/07/2021 | 06:48:00 | 0,000270475 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,000313769 | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,038807581 | 0,038807581 |
| 29/07/2021 | 06:56:00 | 0,018807581 | 0,018807581 |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,000483506 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,000643031 |
| 30/07/2021 | 07:40:00 | 0,026561283 | 0,000533131 |
| 30/07/2021 | 07:40:00 | 0,027743659 | 0,000543031 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,000342    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,056263517 |
| 01/08/2021 | 08:52:00 | 0,000543031 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,000342    | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,025163517 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,026363517 | 0,026363517 |
| 29/07/2021 | 06:48:00 | 0,041807581 | 0,000270475 |
| 29/07/2021 | 06:48:00 | 0,019607581 | 0,000313769 |
| 29/07/2021 | 06:56:00 | 0,000383506 | 0,000383506 |
| 29/07/2021 | 06:56:00 | 0,000583506 | 0,000583506 |
| 30/07/2021 | 07:14:00 | 0,025931434 | 0,025931434 |
| 30/07/2021 | 07:14:00 | 0,025163517 | 0,025163517 |
| 31/07/2021 | 08:42:00 | 0,028854    | 0,028854    |
| 31/07/2021 | 08:42:00 | 0,000383506 | 0,000383506 |
| 01/08/2021 | 08:52:00 | 0,042163517 | 0,042163517 |
| 01/08/2021 | 08:52:00 | 0,035163517 | 0,035163517 |
| 02/08/2021 | 09:07:00 | 0,000313769 | 0,025163517 |
| 02/08/2021 | 09:07:00 | 0,000413769 | 0,026363517 |
+------------+----------+-------------+-------------+

Do you need `df.Column1 = df.Column1.fillna(df.Column2)` and `df.Column2 = df.Column2.fillna(df.Column1)` ? — jezrael, Apr 19 '22 at 11:45
No i don't think so, i will show you the result that it gives me with an update on my post — HamidRouji, Apr 19 '22 at 12:02
@jezrael updated, you can check before the edit what i wanted as output, please. — HamidRouji, Apr 19 '22 at 12:06

jezrael · Accepted Answer · 2022-04-19T12:12:02.257

Use DataFrame.fillna with convert values to index if possible different datetimes:

df1 = df1.replace('NaN', np.nan)
df2 = df2.replace('NaN', np.nan)

df = df1.set_index(['Date','Time']).fillna(df2.set_index(['Date','Time'])).reset_index()
print (df)

          Date      Time   Column1   Column2
0   29/07/2021  06:48:00  0.041808  0.000270
1   29/07/2021  06:48:00  0.019608  0.000314
2   29/07/2021  06:56:00  0.038808  0.000384
3   29/07/2021  06:56:00  0.018808  0.000584
4   30/07/2021  07:14:00  0.025931  0.000484
5   30/07/2021  07:14:00  0.025164  0.000643
6   30/07/2021  07:40:00  0.026561  0.000533
7   30/07/2021  07:40:00  0.027744  0.000543
8   31/07/2021  08:42:00  0.028854  0.000342
9   31/07/2021  08:42:00  0.000384  0.056264
10  01/08/2021  08:52:00  0.000543  0.042164
11  01/08/2021  08:52:00  0.000342  0.035164
12  02/08/2021  09:07:00  0.000314  0.025164
13  02/08/2021  09:07:00  0.000414  0.026364

If always same index and same rows between both DataFrames:

df1 = df1.replace('NaN', np.nan)
df2 = df2.replace('NaN', np.nan)

df = df1.fillna(df2)
print (df)
          Date      Time   Column1   Column2
0   29/07/2021  06:48:00  0.041808  0.000270
1   29/07/2021  06:48:00  0.019608  0.000314
2   29/07/2021  06:56:00  0.038808  0.000384
3   29/07/2021  06:56:00  0.018808  0.000584
4   30/07/2021  07:14:00  0.025931  0.000484
5   30/07/2021  07:14:00  0.025164  0.000643
6   30/07/2021  07:40:00  0.026561  0.000533
7   30/07/2021  07:40:00  0.027744  0.000543
8   31/07/2021  08:42:00  0.028854  0.000342
9   31/07/2021  08:42:00  0.000384  0.056264
10  01/08/2021  08:52:00  0.000543  0.042164
11  01/08/2021  08:52:00  0.000342  0.035164
12  02/08/2021  09:07:00  0.000314  0.025164
13  02/08/2021  09:07:00  0.000414  0.026364

Working perfectly ! Thank you very much – HamidRouji Apr 19 '22 at 12:10 — HamidRouji, Apr 19 '22 at 12:10

Concat 2 dataframes and drop duplicates + consider the date and time to replace the NaN values by data from one of the dataframes

1 Answers1