0

What is a efficient way to remove duplicated rows from a pandas dataframe where I would like always to keep the first value that is not NAN.

Example:

import pandas as pd
import numpy as np
data = pd.DataFrame({'a': [np.nan,np.nan,2,2,3,3,5],
                     'b': [2,1,1,1,np.nan,2,1]},
                     index=[pd.Timestamp('2018-03-01'), pd.Timestamp('2018-03-02'),
                            pd.Timestamp('2018-03-02'), pd.Timestamp('2018-03-02'),
                            pd.Timestamp('2018-03-03'), pd.Timestamp('2018-03-03'),
                            pd.Timestamp('2018-03-04')])

print(data)
>              a    b
> 2018-03-01  NaN  2.0
> 2018-03-02  NaN  1.0  # take 'a' from next row, 'b' from this row
> 2018-03-02  2.0  1.0
> 2018-03-02  2.0  1.0
> 2018-03-03  3.0  NaN  # take 'a' from this row but 'b' from next row
> 2018-03-03  3.0  2.0
> 2018-03-04  5.0  1.0

# Is there something faster?
x = data.groupby(data.index).first()
print(x)

Should give:

>               a    b
> 2018-03-01  NaN  2.0
> 2018-03-02  2.0  1.0
> 2018-03-03  3.0  2.0
> 2018-03-04  5.0  1.0

data.groupby(data.index).first() does that job but it is ridiculously slow. For a dataframe of shape (5'730'238, 7) it required 40 minutes to remove the duplicates, for another table of shape (1'191'704, 339) it took 5 hours 20 minutes. (datetime index, all columns integer/float). Note that the data might contain only few duplicated values.

In another question , they suggest to use data[~data.index.duplicated(keep='first')], but this does not handle NANs in the desired way.

It doesn't really matter, if I choose first, last, mean or whatever, as long as it is fast.

Is there a faster way than groupby or is there a problem with my data that's making it slow.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Snow bunting
  • 1,120
  • 8
  • 28
  • I think this is going to be excessively challenging without the groupby. – cs95 Mar 26 '18 at 06:31
  • Is the date index in ascending order? – HYRY Mar 26 '18 at 07:25
  • How do you have millions of records with daily index? Or daily is just sample data and in reality is has another frequency? – ilia timofeev Mar 26 '18 at 09:40
  • The index is a-priory not sorted but I wouldn't mind to sort it. This is just a sample, I think the real data is pulled together from different loggers, each of them has 1s to 10s intervals. – Snow bunting Mar 26 '18 at 09:43
  • @coldspeed is there a known limitation of groupby where it takes longer on bad data (e.g. too many columns, not sorted index, ...)? 5 hours seems really ridiculous :/ – Snow bunting Mar 26 '18 at 09:45

1 Answers1

0

I think problem is not in algorithm of groupby but in memory consumption. For million records with seven float columns it took 300 ms, you can speedup twice by using resample with your frequency. But 2 million records with 400 float columns is 7Gb of memory and it becomes hell of swapping memory to disk with. On my 16Gb of physical memory and SSD HDD it took 3 minutes to perform groupby on that sample size (with sorted index) most of the time was used to swap memory to disk.

I suggest to: sort your data then split and process it in batches. If you can split it before sorting do it before.


1. Sort

It will sort it as in your sample

df_sample.reset_index(inplace=True)
df_sample.sort_values(by=df_sample.columns.tolist(),na_position ='first',inplace=True)
df_sample.set_index('date',inplace=True)

2. Work in batches

This is not a safe splitting method but enough for performance testing

step = 20000
print (df_sample.shape)
%timeit x = pd.concat([df_sample[s:s+step].resample('min').first() for s in range(0,df_sample.shape[0],step)],axis=0)

It took two minutes instead of three without splitting.


ilia timofeev
  • 1,109
  • 7
  • 15