3

For large datasets, koalas.head(n) function takes a really long time. I understand that it tries to bring back all the data in driver node and then present the absolutely top n rows.

Is there any quick way to analyse top n rows in koalas such that only single or few partitions are involved to get the intended result? I do not want to necessarily see the absolute first n rows, they can be randomly distributed across different executor nodes or even reside within the same partition.

Mohit Jain
  • 733
  • 3
  • 9
  • 24

1 Answers1

2

Adding this statement after importing Koalas seemed to help for me:

koalas.set_option('compute.default_index_type', 'distributed-sequence')
Dean
  • 168
  • 8