Order of magnitude execution time difference of for-loop function between PC and Mac

Question

I have a function that contains a for-loop with numerous actions that I am executing on dataframe with shape (68k,10) in a Jupyter notebook on Windows. My machine has a 11th gen Intel i7-1165G7 2.8Ghz processor (1.5 years old) with 16 GB of RAM.

The execution takes 2928 seconds, as evidenced by the following cProfile results. I am including the top tasks by execution time:

On my colleague's 4-year old Mac, this same function takes 191 seconds:

The built-in method nt.stat that takes 812 seconds on my machine doesn't even appear in my colleague's profiling results. I have also tried this on a 3-rd colleague's older Windows machine, and it takes longer than on mine. Why is there such a big difference in the execution time?

If helpful, I can try to include the entire cProfile results.

Update:

I created a reproducible example and ran on both machines.

Here is the code:

import numpy as np
import pandas as pd

import cProfile

df_shape = [100000, 8]
df = pd.DataFrame(np.random.uniform(0, 0.3, df_shape), index = pd.date_range('2022-01-01', periods=df_shape[0], freq='H'))
df.columns = ['loss_' + str(i) for i in range(df_shape[1])]

var_incoming = np.random.normal(550, 80, df.shape[0])

def test_func(df_, vars):
    a_dct = {}
    wf_dct = {}

    res_lst = []
    wf_res_lst = []

    for i in range(df_.shape[0]):
        y = vars[i]
        for j, z in df_.iloc[i, :].items():
            a = z * y

            a_dct[j] = a

            y = y * (1 - z)
            wf_dct[j] = y

        temp1 = pd.DataFrame(dict(a_dct), index=[df_.index[i]])
        temp2 = pd.DataFrame(dict(wf_dct),
                                    index=[df_.index[i]])

        res_lst.append(temp1)
        wf_res_lst.append(temp2)

    res = pd.concat(res_lst, axis=0)
    wf = pd.concat(wf_res_lst, axis=0)

    return res, wf

The commands performed seem to be largely the same across the machines. The difference in execution is now smaller, and unfortunately, I no longer see the nt.stat command that was previously taking very long.

Colleague's machine:

I suspect the Mac has a GPU that Python can take advantage of for parallel operations. — Barmar, Nov 15 '22 at 17:42
There are painty of reasons that could be the root of that but without a minimal reproducible example, this is very hard to know. If `nt.stat` is a function accessing devices like an hard drive or the network, then it is expected that such a difference appear between machines. Unfortunately, we have no information about the parameter of `nt.stat`, nor the context, nor the disk/network hardware. — Jérôme Richard, Nov 15 '22 at 17:58
@Barmar Pandas tends not to use the GPU, whatever the target platform. In fact, it should fallback to either a Numpy code that do not use the GPU (except maybe for a unusual BLAS configuration) or pure-Python code (especially for strings that are apparently computed here). It would not explain a x100 ratio anyway. — Jérôme Richard, Nov 15 '22 at 18:01
@Jérôme Richard I do not call nt.stat in my code at all so not sure what process is actually responsible for it. I will try to create a minimally-reproducible example. — matsuo_basho, Nov 15 '22 at 19:29
You can get more information by visualizing the call-graph. See [How can you get the call tree with Python profilers?](https://stackoverflow.com/questions/4544784/how-can-you-get-the-call-tree-with-python-profilers) including https://vmprof.readthedocs.io/en/latest/. — Jérôme Richard, Nov 15 '22 at 19:52
I've graphed the output as well both with gprof2dot and tuna, but don't find them useful to explain this np.stat dilemma. Let me know if I'm misunderstanding how to take advantage of the information from the graph. — matsuo_basho, Nov 15 '22 at 21:10
The difference in the new benchmark are relatively small and this is typically what we can expect to see between to machines with two different architecture, and different OS. The biggest point that may also explain the difference in the first benchmark is the version of Pandas, Numpy and Python: are they the same? Versions matters a lot in performance as new version of CPython are significantly faster (due to new optimizations) and sometimes new package versions include either optimization but also regressions. I advise you to try use/test the same package/CPython versions. — Jérôme Richard, Nov 19 '22 at 18:40
Besides this, the provided code is very very inefficient. You should really not iterate on dataframe, especially not like that. Each call to `df_.iloc[i, :].items()` takes 90 us on my machine. That is, certainly far more that all iterations of the loop. Not to mention the `pd.DataFrame` that are also pretty slow: >120 us for the 2. I think your code spent >99% of its time in the overhead of Pandas. The rest is mainly lost in CPython. Overheads can change a lot from between versions. Developers do not optimize this since it is considered as a bad practice. Consider using vectorized functions. — Jérôme Richard, Nov 19 '22 at 18:49
Yes, I considered vectorizing over the rows. I cannot vectorize over the columns (inner loop) because the calc of a given column depends on the outcome of the previous one. For the outer loop, the problem is that I also need to vectorize both over the df and vars simultaneously, and pandas doesn't have the equivalent of mapply2. I use df_.iloc[i, :].items() to preserve the indices, but yes, I can probably assign the indices to the df at the very end. — matsuo_basho, Nov 19 '22 at 18:53
Regarding the difference in the new benchmark to be 'relatively small', this isn't really an acceptable difference (30%) given that my machine is 1 year old with Intel's latest chip and colleague's machine is 4 years old. We are working in an environment where the packages are the same, but will double-check. — matsuo_basho, Nov 19 '22 at 23:30
Upon review, I remembered that I don't think I'd be able to avoid using df_.iloc[i, :].items() because in the actual implementation, I have an if-clause in the loop where there are different operations depending on if the loop has passed a particular column name. — matsuo_basho, Nov 20 '22 at 02:45

Order of magnitude execution time difference of for-loop function between PC and Mac

0 Answers0