dask create combined column to simulate sort by 2 columns

Question

At the moment df.sort_values in dask only accept sorting by 1 column.

I have a large file that have this structure

I don't know how to sort the data by first the integer column and then the date like

2000-01-01 ; 43000
2000-01-02 ; 43000
2000-01-01 ; 25000
2000-01-02 ; 25000

I think that creating a combined column and sort it will be the best option. The problem is that I don't know how to create a column that accomplish this.Maybe there is another option to do this without creating a combined column in Dask...

Thanks!

Sounds like this is a duplicate of https://stackoverflow.com/questions/50809462/sorting-in-dask — Joshua Voskamp, Nov 11 '21 at 12:48

Joshua Voskamp · Answer 1 · 2021-11-11T12:57:24.993

0

Assuming d['col1'] is datetime-type, and d['col2'] is int-type:

import struct
import numpy as np

# create a timedelta with days resolution as int
d['col1_int'] = ((d['col1_dt'] -
                  d['col1_dt'].min())/np.timedelta64(1,'D')
                ).astype(int)

d['sort_col'] = d.apply(lambda r: struct.pack("ll",r.col1_int,r.col2))
d = d.set_index('sort_col')
d = d.map_partitions(lambda x: x.sort_index())

Reworked from this answer

edited Nov 11 '21 at 12:57

answered Oct 27 '21 at 20:34

Joshua Voskamp

1,855
1
10
13

Dask doesn't allow to add multiple columns for the sort_values.... – lsgrep Oct 28 '21 at 17:45

dask create combined column to simulate sort by 2 columns

1 Answers1