4

I have a DataFrame with MultiIndex which is basically a binary matrix:

day        day01                      day02                  
session session1 session2 session3 session1 session2 session3
0              1        0        0        0        0        0
1              0        0        1        1        1        0
2              1        1        1        0        0        1
3              1        0        0        1        0        0
4              1        0        1        0        0        0

From this DataFrame, I need to calculate daily sums for each row:

     day01  day02
0        1      0
1        1      2
2        3      1
3        1      1
4        2      0

And get the number of 0s, 1s... (value counts) in this sum:

0    2
1    5
2    2
3    1

I need to do this for sessions, too. Session sums for each row:

         session1  session2  session3
0               1         0         0
1               1         1         1
2               1         1         2
3               2         0         0
4               1         0         1

And get the value counts:

0    5
1    8
2    2

As a baseline, this is the result of df.groupby(level='day', axis=1).sum().stack().value_counts() (and df.groupby(level='session', axis=1).sum().stack().value_counts()). The DataFrame changes in each iteration of a simulated annealing algorithm and these counts are recalculated. When I profiled the code I saw that a significant amount of time spent on groupby operations.

I tried saving groupby objects and taking sums on those objects in each iteration but the improvement was about 10%. Here's the code to create a larger DataFrame (similar to the one I have):

import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
days = ['day{0:02d}'.format(i) for i in range(1, 11)]
sessions = ['session{}'.format(i) for i in range(1, 5)]
idx = pd.MultiIndex.from_product((days, sessions), names=['day', 'session'])
df = pd.DataFrame(prng.binomial(1, 0.25, (1250, 40)), columns=idx)

In my computer, the following two methods take 3.8s and 3.38s respectively.

def try1(df, num_repeats=1000):
    for i in range(num_repeats):
        session_counts = (df.groupby(level='session', axis=1, sort=False)
                            .sum()
                            .stack()
                            .value_counts(sort=False))
        daily_counts = (df.groupby(level='day', axis=1, sort=False)
                          .sum()
                          .stack()
                          .value_counts(sort=False))
    return session_counts, daily_counts

def try2(df, num_repeats=1000):
    session_groups = df.groupby(level='session', axis=1, sort=False)
    day_groups = df.groupby(level='day', axis=1, sort=False)
    for i in range(num_repeats):
        df.iat[0, 0] = (i + 1) % 2
        session_counts = session_groups.sum().stack().value_counts(sort=False)
        daily_counts = day_groups.sum().stack().value_counts(sort=False)
    return session_counts, daily_counts

%time try1(df)
Wall time: 3.8 s

%time try2(df)
Wall time: 3.38 s

Note: The loops in the functions are for timing only. For the second function in order to get correct timings I needed to modify the DataFrame.

I am currently working on another method to directly reflect the changes in the DataFrame to counts without recalculating the groups but I haven't succeeded yet. Tracking the affected rows and updating saved DataFrames turned out to be slower.

Is there a way to improve the performance of these groupby operations?

ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Does the order of elems in the two outputs matter? Additionally, do the indexes of the two outputs matter? – Divakar Aug 20 '16 at 12:39
  • No, as long as I know how many 0's, 1's etc. are in there, the order (or which data structure holds that information) is not important. I should know which one corresponds to 0's, which one corresponds to 1's though. – ayhan Aug 20 '16 at 12:42

1 Answers1

2

Assuming a regular data format (equal number of days and sessions across each row), here's a NumPy based approach using np.unique with the output having their indexes in sorted order -

# Extract array
a,b = df.columns.levels
arr = df.values.reshape(-1,len(a),len(b))

# Get session counts
session_sums = arr.sum(1)
unq,count = np.unique(session_sums,return_counts=True)
session_counts_out = pd.Series(count,index=unq)

# Get daily count
daily_sums = arr.sum(2)
unq,count = np.unique(daily_sums,return_counts=True)
daily_counts_out = pd.Series(count,index=unq)

If you are only interested in the values without the indexes, here's an alternative with np.bincount that essentially just does the counting, as done by return_counts part with np.unique -

# Get session counts
session_sums = arr.sum(1)
count = np.bincount(session_sums.ravel())
session_counts_out = count[count>0]

# Get daily count
daily_sums = arr.sum(2)
count = np.bincount(daily_sums.ravel())
daily_counts_out = count[count>0]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks. Looks very promising. Let me try it. – ayhan Aug 20 '16 at 13:17
  • bincount is about 7 times faster than groupby (I removed the part for `count[count>0]` so I can access by index). Let me keep it open for a couple of days to see if there are any other alternatives. Thank you again. – ayhan Aug 20 '16 at 14:17
  • using np.bincount is quite an anti pattern - it only is valid with a very limited input set and doesn't do any conversions; you are forcing the user to do so much work – Jeff Aug 20 '16 at 15:46
  • @Jeff If by very limited input set, you mean numerals, yes it's meant for that only. But the upside is the performance. I guess it's a trade-off that a user can decide. `np.unique` is always there as a good backup as mentioned in the first approach. Comeon downvote for that!? – Divakar Aug 20 '16 at 15:48
  • no it's not - pd.unique is MUCH faster and just do a straight value_counts – Jeff Aug 20 '16 at 15:49
  • @Jeff You are saying `pd.unique` alongwith its `value_counts ` is faster than both `np.unique` and `np.bincount`? – Divakar Aug 20 '16 at 15:52
  • try it - it doesn't sort and uses hashtables – Jeff Aug 20 '16 at 15:52
  • 4
    @Jeff Yeah I am not sure how to use `pd.unique` for this case. Why not post that as an answer and we can time it? – Divakar Aug 20 '16 at 15:58
  • @Jeff To be fair, me adding the numpy tag led to these suggestions. The DataFrame consists of 0 and 1s. Those are the only valid entries, and there are no missing values. Their summation will be a non-negative integer, so for this specific case bincount seems appropriate though it's not a general substitute for value_counts. I am happily using pandas before and after that algorithm runs but inside that loop I need a faster alternative be it a pandas or a numpy method. I might need to start learning about cython too. – ayhan Aug 21 '16 at 13:45
  • @Jeff We are waiting for the `MUCH faster` pd.unique solution to drop that's gonna blow away both `np.unique` and `np.bincount` ;) – Divakar Aug 22 '16 at 12:50
  • well I am away- I think u can simply time pd.unique and np.unique – Jeff Aug 22 '16 at 13:06
  • @Jeff `pd.unique(df)` doesn't work on the sample dataframe provided in the question. That `pd.unique` also gonna be `MUCH faster` than `np.bincount`, right? We can wait till you get back. – Divakar Aug 22 '16 at 13:20
  • Hey @Jeff - Do you have a particular issue with my NumPy based solutions on `pandas` tagged questions based on the downvotes I am getting on those from you? If there's any, I would love to hear so that I can improve. Otherwise, I am not getting the message if any, with those downvotes. – Divakar Sep 24 '16 at 09:52
  • @Jeff Again, I am not sure on the latest downvote from you on my multi-index pandas tagged answer. If there's an issue or problem or just anything, I would like to know that and improve upon. Without any comment, I am not getting any idea on any of those. Would really appreciate some kind of feedback. – Divakar Jan 06 '17 at 08:28