How can I overwrite a pandas dataframe with another one in the most pythonic and fastest way?
Let me elaborate on my use case and what all methods I have tried so far and the results which I got with each one of them.
I have 2 pandas dataframes, let me call them bigdf
and smalldf
. I want to replace the rows, columns of bigdf
with the values from smalldf
.
bigdf
A B C D E F G H I J K L M N
0 2021-02-12 80 1 3000 100 3100 2021-02-12 05:00:00 0 2021-02-12 05:01:00 1.0 60.0 1.0 2021-02-12 05:03:00 -6.912197
1 2021-02-12 80 1 4000 100 4100 2021-02-12 05:05:00 1000 NaT 0.0 NaN NaN 2021-02-12 05:05:40 -6.658210
2 2021-02-12 80 1 5000 150 5100 2021-02-12 05:10:00 1200 NaT NaN NaN NaN NaT NaN
3 2021-02-12 80 1 6000 150 6100 2021-02-12 05:15:00 1500 NaT NaN NaN NaN NaT NaN
4 2021-02-12 40 1 7000 300 7100 2021-02-12 05:10:00 700 NaT NaN NaN NaN NaT NaN
5 2021-02-12 40 1 8000 300 8100 2021-02-12 05:05:00 980 NaT NaN NaN NaN NaT NaN
6 2021-02-12 60 1 9000 400 9100 2021-02-12 05:15:00 1300 NaT NaN NaN NaN NaT NaN
smalldf
A B C E F I J M N
0 2021-02-12 80 1 100 3100 NaT NaN NaT -6.912197
1 2021-02-12 80 1 100 4100 2021-02-12 05:04:30 1.0 2021-02-12 05:05:59 -6.658210
2 2021-02-12 80 1 150 5100 2021-02-12 05:11:30 1.0 2021-02-12 05:11:00 53.308885
The matching columns in both dataframes are A
, B
, E
and F
.
The result which I am expecting after overwritting smalldf values into bigdf is:
Resultant bigdf
A B C D E F G H I J K L M N
0 2021-02-12 80 1 3000 100 3100 2021-02-12 05:00:00 0 NaT NaN 60.0 1.0 NaT -6.912197
1 2021-02-12 80 1 4000 100 4100 2021-02-12 05:05:00 1000 2021-02-12 05:04:30 1.0 NaN NaN 2021-02-12 05:05:59 -6.658210
2 2021-02-12 80 1 5000 150 5100 2021-02-12 05:10:00 1200 2021-02-12 05:11:30 1.0 NaN NaN 2021-02-12 05:11:00 53.308885
3 2021-02-12 80 1 6000 150 6100 2021-02-12 05:15:00 1500 NaT NaN NaN NaN NaT NaN
4 2021-02-12 40 1 7000 300 7100 2021-02-12 05:10:00 700 NaT NaN NaN NaN NaT NaN
5 2021-02-12 40 1 8000 300 8100 2021-02-12 05:05:00 980 NaT NaN NaN NaN NaT NaN
6 2021-02-12 60 1 9000 400 9100 2021-02-12 05:15:00 1300 NaT NaN NaN NaN NaT NaN
As is evident from the resultant bigdf, the values where the rows and columns matched with smalldf are all overwritten by corresponding values from smalldf covering all of the below scenarios:
- Value overwritten by another value.
- NaN/NaT overwritten by value.
- Value overwritten by NaN/NaT.
To get this result I have used the following code:
smalldf = smalldf.set_index(['A', 'B', 'E', 'F'])
bigdf = bigdf.set_index(['A', 'B', 'E', 'F'])
var1 = bigdf['I']
var2 = bigdf['J']
var3 = bigdf['M']
var4 = bigdf['N']
for index, row in smalldf.iterrows():
if index in bigdf.index:
var1[index] = row['I']
var2[index] = row['J']
var3[index] = row['M']
var4[index] = row['N']
bigdf['I'] = var1
bigdf['J'] = var2
bigdf['M'] = var3
bigdf['N'] = var4
bigdf = bigdf.reset_index()
But the problem is that this code as much as it gives correct results is very slow. For a bigdf with about 8K rows and 50 columns it takes about 30 seconds.
For faster output, I have tried combine_first
but this function only overwrites Null values.
Next I tried bigdf.update(smalldf)
but this one does not overwrites a value with a Null (if the case be).
Can anyone suggest a faster and accurate method to achieve the result I am looking at? I am looking at a solution which would take not more than a couple of seconds to overwrite the dataframe of the order of approx (10K, 50). The smalldf will be approx (2K, 30).
My emphasis here is the execution speed. Any help would be appreciated.