0

How can I overwrite a pandas dataframe with another one in the most pythonic and fastest way?

Let me elaborate on my use case and what all methods I have tried so far and the results which I got with each one of them.

I have 2 pandas dataframes, let me call them bigdf and smalldf. I want to replace the rows, columns of bigdf with the values from smalldf.

bigdf

           A   B  C     D    E     F                   G     H                   I    J     K    L                   M         N
0 2021-02-12  80  1  3000  100  3100 2021-02-12 05:00:00     0 2021-02-12 05:01:00  1.0  60.0  1.0 2021-02-12 05:03:00 -6.912197
1 2021-02-12  80  1  4000  100  4100 2021-02-12 05:05:00  1000                 NaT  0.0   NaN  NaN 2021-02-12 05:05:40 -6.658210
2 2021-02-12  80  1  5000  150  5100 2021-02-12 05:10:00  1200                 NaT  NaN   NaN  NaN                 NaT       NaN
3 2021-02-12  80  1  6000  150  6100 2021-02-12 05:15:00  1500                 NaT  NaN   NaN  NaN                 NaT       NaN
4 2021-02-12  40  1  7000  300  7100 2021-02-12 05:10:00   700                 NaT  NaN   NaN  NaN                 NaT       NaN
5 2021-02-12  40  1  8000  300  8100 2021-02-12 05:05:00   980                 NaT  NaN   NaN  NaN                 NaT       NaN
6 2021-02-12  60  1  9000  400  9100 2021-02-12 05:15:00  1300                 NaT  NaN   NaN  NaN                 NaT       NaN

smalldf

           A   B  C    E     F                   I    J                   M          N
0 2021-02-12  80  1  100  3100                 NaT  NaN                 NaT  -6.912197
1 2021-02-12  80  1  100  4100 2021-02-12 05:04:30  1.0 2021-02-12 05:05:59  -6.658210
2 2021-02-12  80  1  150  5100 2021-02-12 05:11:30  1.0 2021-02-12 05:11:00  53.308885

The matching columns in both dataframes are A, B, Eand F.

The result which I am expecting after overwritting smalldf values into bigdf is:

Resultant bigdf

           A   B  C     D    E     F                   G     H                   I    J     K    L                   M          N
0 2021-02-12  80  1  3000  100  3100 2021-02-12 05:00:00     0                 NaT  NaN  60.0  1.0                 NaT  -6.912197
1 2021-02-12  80  1  4000  100  4100 2021-02-12 05:05:00  1000 2021-02-12 05:04:30  1.0   NaN  NaN 2021-02-12 05:05:59  -6.658210
2 2021-02-12  80  1  5000  150  5100 2021-02-12 05:10:00  1200 2021-02-12 05:11:30  1.0   NaN  NaN 2021-02-12 05:11:00  53.308885
3 2021-02-12  80  1  6000  150  6100 2021-02-12 05:15:00  1500                 NaT  NaN   NaN  NaN                 NaT        NaN
4 2021-02-12  40  1  7000  300  7100 2021-02-12 05:10:00   700                 NaT  NaN   NaN  NaN                 NaT        NaN
5 2021-02-12  40  1  8000  300  8100 2021-02-12 05:05:00   980                 NaT  NaN   NaN  NaN                 NaT        NaN
6 2021-02-12  60  1  9000  400  9100 2021-02-12 05:15:00  1300                 NaT  NaN   NaN  NaN                 NaT        NaN

As is evident from the resultant bigdf, the values where the rows and columns matched with smalldf are all overwritten by corresponding values from smalldf covering all of the below scenarios:

  1. Value overwritten by another value.
  2. NaN/NaT overwritten by value.
  3. Value overwritten by NaN/NaT.

To get this result I have used the following code:

smalldf = smalldf.set_index(['A', 'B', 'E', 'F'])
bigdf = bigdf.set_index(['A', 'B', 'E', 'F'])

var1 = bigdf['I']
var2 = bigdf['J']
var3 = bigdf['M']
var4 = bigdf['N']

for index, row in smalldf.iterrows():
   if index in bigdf.index:
      var1[index] = row['I']
      var2[index] = row['J']
      var3[index] = row['M']
      var4[index] = row['N']

bigdf['I'] = var1
bigdf['J'] = var2
bigdf['M'] = var3
bigdf['N'] = var4

bigdf = bigdf.reset_index()

But the problem is that this code as much as it gives correct results is very slow. For a bigdf with about 8K rows and 50 columns it takes about 30 seconds.

For faster output, I have tried combine_first but this function only overwrites Null values. Next I tried bigdf.update(smalldf) but this one does not overwrites a value with a Null (if the case be).

Can anyone suggest a faster and accurate method to achieve the result I am looking at? I am looking at a solution which would take not more than a couple of seconds to overwrite the dataframe of the order of approx (10K, 50). The smalldf will be approx (2K, 30).

My emphasis here is the execution speed. Any help would be appreciated.

Prachi
  • 494
  • 3
  • 8
  • 21
  • why not just reassign the values like so bigdf[column]= smaldf[column]. – Ade_1 Feb 20 '21 at 18:16
  • Does this answer your question? [Python pandas - particular merge/replacement](https://stackoverflow.com/questions/25762902/python-pandas-particular-merge-replacement) – Yulian Feb 20 '21 at 18:26
  • `Ade_1` This won't work as the number of rows in both DFS are not same. – Prachi Feb 20 '21 at 18:32
  • `Moti` I checked the post that you have suggested but nothing came out of it in terms of faster execution speed. – Prachi Feb 20 '21 at 18:33
  • I believe this is the same question: https://stackoverflow.com/questions/52728506/update-a-dataframe-by-dataframes-with-nan-values – bicarlsen Feb 20 '21 at 19:54
  • `Bicarlsen` The post which you have posted clearly states that if the number of rows are less in the df whose values are to be used for overwritting then for the df which is to be overwritten all such row values will become NaN. So esentially I am not looking at such a solution. – Prachi Feb 21 '21 at 10:42

0 Answers0