1

At the moment df.sort_values in dask only accept sorting by 1 column.

I have a large file that have this structure

input data

I don't know how to sort the data by first the integer column and then the date like

  • 2000-01-01 ; 43000
  • 2000-01-02 ; 43000
  • 2000-01-01 ; 25000
  • 2000-01-02 ; 25000

I think that creating a combined column and sort it will be the best option. The problem is that I don't know how to create a column that accomplish this.Maybe there is another option to do this without creating a combined column in Dask...

Thanks!

lsgrep
  • 33
  • 3

1 Answers1

0

Assuming d['col1'] is datetime-type, and d['col2'] is int-type:

import struct
import numpy as np

# create a timedelta with days resolution as int
d['col1_int'] = ((d['col1_dt'] -
                  d['col1_dt'].min())/np.timedelta64(1,'D')
                ).astype(int)

d['sort_col'] = d.apply(lambda r: struct.pack("ll",r.col1_int,r.col2))
d = d.set_index('sort_col')
d = d.map_partitions(lambda x: x.sort_index())

Reworked from this answer

Joshua Voskamp
  • 1,855
  • 1
  • 10
  • 13