48

The following code works well. Just checking: am I using and timing Pandas correctly and is there any faster way? Thanks.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'

def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

def randFloat(numGrp, N) :
   things = [round(100*np.random.random(),4) for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N=int(1e8)
K=100
DF = pd.DataFrame({
  'id1' : randChar("id%03d", K, N),       # large groups (char)
  'id2' : randChar("id%03d", K, N),       # large groups (char)
  'id3' : randChar("id%010d", N//K, N),   # small groups (char)
  'id4' : np.random.choice(K, N),         # large groups (int)
  'id5' : np.random.choice(K, N),         # large groups (int)
  'id6' : np.random.choice(N//K, N),      # small groups (int)            
  'v1' :  np.random.choice(5, N),         # int in range [1,5]
  'v2' :  np.random.choice(5, N),         # int in range [1,5]
  'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
})

Now time 5 different groupings, repeating each one twice to confirm the timing. [I realise timeit(2) runs it twice, but then it reports the total. I'm interested in the time of the first and second run separately.] Python uses about 10G of RAM according to htop during these tests.

>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1)
5.604133386000285
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1)
5.505057081000359

>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1)
14.232032927000091
>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1)
14.242601240999647

>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1)
22.87025260900009
>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1)
22.393589012999655

>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9725865330001398
>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9683854739996605

>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1)
12.776488024999708
>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1)
13.558292575999076

Here is system info :

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.048
BogoMIPS:              5066.38
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

$ free -h
             total       used       free     shared    buffers     cached
Mem:          240G        74G       166G       372K        33M       550M
-/+ buffers/cache:        73G       166G
Swap:           0B         0B         0B

I don't believe it's relevant but just in case, the randChar function above is a workaround for a memory error in mtrand.RandomState.choice :

How to solve memory error in mtrand.RandomState.choice?

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 4
    `df.groupby` is pretty well optimized. What alternatives are you considering? The only thing I can think of is to set the `id` columns as the index and then use `df.groupby(level=id_whatever)`. – Paul H Sep 02 '14 at 19:55
  • @PaulH Thanks I'll try `id` columns as an index. I'm comparing to R's `data.table` (which I maintain). – Matt Dowle Sep 02 '14 at 20:02
  • oh cool. the other thing i'll mention is that doing this in an IPython Notebook and using the `timeit` magic will probably maintain some of your sanity. http://nbviewer.ipython.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb#-Some-simple-cell-magics – Paul H Sep 02 '14 at 20:06
  • @PaulH Nice. Can I point IPython to a server over SSH? – Matt Dowle Sep 02 '14 at 20:41
  • 1
    I believe so, but you might find it too complicated for this exercise: http://ipython.org/ipython-doc/1/interactive/public_server.html. Last comment: also not trivial to set up, but there is an %%R magic for notebooks, so you can run python code and R code from the same interface. There's also a native R kernel in the works if it hasn't been released already. – Paul H Sep 02 '14 at 20:46
  • One thing I would say, is that it may be worth seeding your random numbers, since (I'm sure you're aware!) they will effect the speed of the groupbys. In the pandas vbench code it explicity tests performance for both dense and sparse groups (no doubt that is *not* the technical term - sorry) as they have different behaviour. – Andy Hayden Sep 03 '14 at 05:35
  • @AndyHayden Good point. The construction in the question here is a tough but realistic test in that the groups are thoroughly shuffled in the data. If there is some clustering of the groups then it should be faster due to cache efficiency. Is that what you mean? 'seeding' to me means setting the `random.seed` to get the same series of random numbers each time. – Matt Dowle Sep 03 '14 at 16:22
  • @MattDowle oops yes, two parts: 1. setting np.random.seed to make times fair/constant ([not done in pandas yet](https://github.com/pydata/pandas/issues/8144)), 2. testing different cluster sizes (large vs small groups). – Andy Hayden Sep 03 '14 at 21:25
  • @AndyHayden Thanks to your comments I've added some items to the [FAQ](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping) On 2) that's already done in this test, for different column types as well. Feel free to edit directly or suggest more tests. – Matt Dowle Sep 04 '14 at 00:54
  • @MattDowle thanks for putting this together, excited to see the results! – Andy Hayden Sep 04 '14 at 01:04
  • 8
    @AndyHayden Pandas results now added: [benchmarks](https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping) – Matt Dowle Sep 12 '14 at 11:48
  • 4
    This is really, *really* nice. – WGS Oct 20 '14 at 20:32
  • I would also suggest using IPython, and using it's %timeit function. There are other magic function commands as well, and I have found they do a better job of timing functions or lines of code. http://ipython.org/ipython-doc/dev/interactive/magics.html#magic-timeit – Wesley Bowman Nov 07 '14 at 13:12
  • @NightHallow Thanks. I'll use IPython's %timeit next time. – Matt Dowle Nov 07 '14 at 17:22
  • 6
    I'm voting to close this question as off-topic because questions about optimization of working code belong to http://codereview.stackexchange.com/ – bummi Jan 31 '15 at 16:59
  • 2
    Or, why doesn't the OP just answer the question. It has now been upvoted 39 times and 2000 people have viewed it. So that answer can just be the question e.g. Yes, this seems to be the fastest way to group in pandas. – AN6U5 Feb 10 '15 at 21:51

1 Answers1

4

If you'd like to install the iPython shell, you can easily time your code using %timeit. After installing it, instead of typing python to launch the python interpreter, you would type ipython.

Then you can type your code exactly as you would type it in the normal interpreter (as you did above).

Then you can type, for example:

%timeit DF.groupby(['id1']).agg({'v1':'sum'})

This will accomplish exactly the same thing as what you've done, but if you're using python a lot I find that this will save you significant typing time :).

Ipython has a lot of other nice features (like %paste, which I used to paste in your code and test this, or %run to run a script you've saved in a file), tab completion, etc. http://ipython.org/

tegan
  • 2,125
  • 2
  • 14
  • 17