1

I am working on some huge volume of data, rows around 50 millions. I want to find unique columns values from multiple columns. I use below script.

dataAll[['Frequency', 'Period', 'Date']].drop_duplicates()

But this is taking long time, more than 40minutes.

I found some alternative:

pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))

enter image description here

but above script will give array, but I need in dataframe like first script will give as below

enter image description here

Learnings
  • 2,780
  • 9
  • 35
  • 55
  • 1
    You could maintain a Python [set](https://docs.python.org/2/library/sets.html) of tuples of the labels `Frequency`, `Period`, and `Date`, iterating over the rows and checking/updating the set's membership. This should be roughly linear in the number of rows, except perhaps for the tuple creation. I'd be surprised, however, if pandas did not take a similar approach to their `drop_duplicates`. – Nelewout Jun 09 '18 at 08:33

1 Answers1

1

Generaly your new code is imposible convert to DataFrame, because:

pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))

create one big 1d numpy array, so after remove duplicates is impossible recreate rows.

E.g. if there are 2 unique values 3 and 1 is impossible find which datetimes are for 3 and for 1.


But if there is only one unique value for Frequency and for each Period is possible find Date like in sample, solution is possible.

EDIT:

One possible alternative is use dask.dataframe.DataFrame.drop_duplicates.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks... Is there any alternatives? I need to improve performance of script – Learnings Jun 09 '18 at 08:25
  • Frequency contains 0,1,2,3,4,5,6 for different dates. – Learnings Jun 09 '18 at 08:26
  • Could you please give solution for 'Period', 'Date' - only 2 columns – Learnings Jun 09 '18 at 08:28
  • @faithon.gvr.py - 50millions is really huge Dataframe. I suggest working in server with a lot of RAM, also some another library like `dask` should be helpful. Also some idea from [this](https://stackoverflow.com/q/14262433/2901002) should help. – jezrael Jun 09 '18 at 08:49