Pandas groupby for large data with multiple nested loops for aggregating data are slow

Question

I have a script where I need to aggregate the data based on some logic. For examples here is my data frame

1. df.columns = ['a','b','c','d']
2. Dataframe contains more than 1 million data(10 millions in some cases)
3. Now I need to aggregate data something like as(having 4 nested loops)-
    groupby_a = df.groupby(['a'])
    for i,df_a in groupby_a:
        #some logic...
        groupby_b = df_a.groupby(['b'])
        for j,df_b in groupby_b:    # loop again
            # logic followed by 2 more nested loops(groupby 'c' and 'd')

My issue is that it is taking too much time to process the data. Is there anyway I can increase the performance? Any help really appreciated.

Have you tried profiling your code to find the bottleneck? Try `cProfiler` to find out which functions / methods are expensive. The problem may or may not be in the bits where you specify `# logic`. — jpp, Apr 11 '18 at 12:55
@jpb Thanks for your suggestion. How can I use profilier in a module? — Workonphp, Apr 11 '18 at 13:16
[How can you profile a script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) — jpp, Apr 11 '18 at 13:17
I saw that link just after your 1st comment. I meant to ask you is there any way we can profile a file(or module) which is part of a package? Or to be more specific django app — Workonphp, Apr 11 '18 at 13:20
The link works with anything you can run from command line, from Jupyter notebook, from any IDE, etc. Feel free to ask a separate question if you get stuck profiling, providing you can pinpoint why it's not working for you. — jpp, Apr 11 '18 at 13:21

Pandas groupby for large data with multiple nested loops for aggregating data are slow

0 Answers0