4

Given a table with many columns

|-------|-------|-------|-------|
|   A   |   B   |  ..   |   N   |
|-------|-------|-------|-------|
|   1   |   0   |  ..   |   X   |
|   2   |   0   |  ..   |   Y   |
|  ..   |  ..   |  ..   |  ..   |
|-------|-------|-------|-------|

What is the most efficient way to iterate over all column combinations (of all length) and perform a GROUP BY operation? As the table and esp. combinations can be quite large (2^n), preferable with GPU support.

 colnames = df.columns
 for L in range(2,len(colnames)):
   for comb in itertools.combinations(colnames, L):
     dfg = df.groupby(comb, sort=False).size().reset_index().rename(columns={0:'count'})
Reacher234
  • 230
  • 2
  • 11
  • How large is the table? – NYC Coder May 23 '20 at 16:50
  • Lets say >100 columns. So its less about the group by performance rather the combinations to be evaluated – Reacher234 May 23 '20 at 17:17
  • 4
    With 100 columns, you're taking about 100 choose 2 + 100 choose 3 + ... 100 choose 99 calls to groupby. This looks to be on the magnitude of 10^30. If each kernel call runs serially and takes even just one nanosecond, you'd still need to wait 10^21 seconds (forever). Even if you could run one million threads concurrently and each one still finished in one nanosecond, you'd need to wait 10^15 seconds (forever). You'll likely need to reframe this problem in order to find success. – Nick Becker Jun 04 '20 at 04:09
  • Thanks @NickBecker, super valid points. There are some greedy approaches and options of eliminating branches early, but wanted to keep the problem simple to understand and focus directed towards GPU acceleration. – Reacher234 Jun 04 '20 at 10:21

0 Answers0