5

What is the proper way to query top N rows by group in python datatable?
For example to get top 2 rows having largest v3 value by id2, id4 group I would do pandas expression in the following way:

df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)

in R using data.table:

DT[order(-v3), head(v3, 2L), by=.(id2, id4)]

or in R using dplyr:

DF %>% arrange(desc(v3)) %>% group_by(id2, id4) %>% filter(row_number() <= 2L)

Example data and expected output using pandas:

import datatable as dt
dt = dt.Frame(id2=[1, 2, 1, 2, 1, 2], id4=[1, 1, 1, 1, 1, 1], v3=[1, 3, 2, 3, 3, 3])
df = dt.to_pandas()
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
#   id2  id4  v3
#1    2    1   3
#3    2    1   3
#4    1    1   3
#2    1    1   2
Pasha
  • 6,298
  • 2
  • 22
  • 34
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • What is wrong with your example code? Also do you mean `DataFrame`? – Dan Jan 10 '19 at 12:27
  • @Dan nothing is wrong except that it won't scale, thus my question for datatable solution. I mean `datatable` not `DataFrame`, added link in question so it is more clear. – jangorecki Jan 10 '19 at 12:32

1 Answers1

9

Starting from datatable version 0.8.0, this can be achieved by combining grouping, sorting and filtering:

from datatable import *
DT = Frame(id2=[1, 2, 1, 2, 1, 2], 
           id4=[1, 1, 1, 1, 1, 1], 
           v3=[1, 3, 2, 3, 3, 3])

DT[:2, :, by(f.id2, f.id4), sort(-f.v3)]

which produces

     id2  id4  v3
---  ---  ---  --
 0     1    1   3
 1     1    1   2
 2     2    1   3
 3     2    1   3

[4 rows x 3 columns]

Explanation:

  • by(f.id2, f.id4) groups the data by columns "id2" and "id4";
  • the sort(-f.v3) command tells datatable to sort the records by column "v3" in descending order. In the presence of by() this operator will be applied within each group;
  • the first :2 selects the top 2 rows, again within each group;
  • the second : selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.
Pasha
  • 6,298
  • 2
  • 22
  • 34