What is a efficient way to remove duplicated rows from a pandas dataframe where I would like always to keep the first value that is not NAN.
Example:
import pandas as pd
import numpy as np
data = pd.DataFrame({'a': [np.nan,np.nan,2,2,3,3,5],
'b': [2,1,1,1,np.nan,2,1]},
index=[pd.Timestamp('2018-03-01'), pd.Timestamp('2018-03-02'),
pd.Timestamp('2018-03-02'), pd.Timestamp('2018-03-02'),
pd.Timestamp('2018-03-03'), pd.Timestamp('2018-03-03'),
pd.Timestamp('2018-03-04')])
print(data)
> a b
> 2018-03-01 NaN 2.0
> 2018-03-02 NaN 1.0 # take 'a' from next row, 'b' from this row
> 2018-03-02 2.0 1.0
> 2018-03-02 2.0 1.0
> 2018-03-03 3.0 NaN # take 'a' from this row but 'b' from next row
> 2018-03-03 3.0 2.0
> 2018-03-04 5.0 1.0
# Is there something faster?
x = data.groupby(data.index).first()
print(x)
Should give:
> a b
> 2018-03-01 NaN 2.0
> 2018-03-02 2.0 1.0
> 2018-03-03 3.0 2.0
> 2018-03-04 5.0 1.0
data.groupby(data.index).first()
does that job but it is ridiculously slow.
For a dataframe of shape (5'730'238, 7)
it required 40 minutes to remove the duplicates,
for another table of shape (1'191'704, 339)
it took 5 hours 20 minutes.
(datetime index, all columns integer/float).
Note that the data might contain only few duplicated values.
In another question , they suggest to use data[~data.index.duplicated(keep='first')]
,
but this does not handle NANs in the desired way.
It doesn't really matter, if I choose first
, last
, mean
or whatever, as long as
it is fast.
Is there a faster way than groupby
or is there a problem with my data that's making it slow.