How to make this for loop faster?

Question

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster. I have a pandas dataframe called "acoustics" which contains over 10 million rows:

print(acoustics)
                        timestamp            c0  rowIndex
0        2016-01-01T00:00:12.000Z  13931.500000   8158791
1        2016-01-01T00:00:30.000Z  14084.099609   8158792
2        2016-01-01T00:00:48.000Z  13603.400391   8158793
3        2016-01-01T00:01:06.000Z  13977.299805   8158794
4        2016-01-01T00:01:24.000Z  13611.000000   8158795
5        2016-01-01T00:02:18.000Z  13695.000000   8158796
6        2016-01-01T00:02:36.000Z  13809.400391   8158797
7        2016-01-01T00:02:54.000Z  13756.000000   8158798

and there is the code I wrote:

acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
    time = acoustics.iloc[i]['timestamp']
    sum = 0
    for c in range(5):
        sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
    print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
    deltaAc.append([time, sum])

deltaAc = pd.DataFrame(deltaAc)

It takes a huge amount of time, how can I make it faster?

What version of python are you running? And how long is ```acoustics```? — Chillie, Mar 04 '19 at 15:45
@RehimAlizadeh then i would only print multiples of 100k or something — SuperStew, Mar 04 '19 at 15:46
@Chillie 3.5, `acoustics` contains approximately 10 milion rows — Rehim Alizadeh, Mar 04 '19 at 15:47
You can turn your problem into a full numpy loop with acoustics.values and go from there — Alessandro, Mar 04 '19 at 15:52
Normally when using `pandas` you don't want to use `for` loops. It looks like you are making a new series based on some criteria from the rest of the dataframe, is that right? Here is [an example](https://stackoverflow.com/questions/12376863/adding-calculated-columns-to-a-dataframe-in-pandas) of using `pandas` without having to use for loops. — Jack Moody, Mar 04 '19 at 15:52
@JackMoody yes I am trying to make a new series based on a calculation done on all the values of `acoustics`. How can I make something like in than example for this case? — Rehim Alizadeh, Mar 04 '19 at 15:58
can you provide sample data as well? looking at your code, I'd wrap the inner for loop in a function and use `apply`. also since your data is large, i'd try utilizing `dask` instead of `pandas`. also to track progress, i'd use `tqdm`. all of that are suggestions - you should try to read on them and implement them. once you provide data, others can try to replicate it for you. — Amir Imani, Mar 04 '19 at 15:59
@AmirhosImani I edited the post and added several rows from the beginning of the data. Could you give an example of how to do that? — Rehim Alizadeh, Mar 04 '19 at 16:02

Ben.T · Accepted Answer · 2019-03-05T12:41:59.163

1

You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:

deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'], 
                       'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
                                               *np.array(weights)).sum(1)/sumWeights})

and you get the same values than what I get with your code:

print (deltaAc)
                  timestamp  summation
5  2016-01-01T00:02:18.000Z -41.799986
6  2016-01-01T00:02:36.000Z  51.418728
7  2016-01-01T00:02:54.000Z  -3.111184

edited Mar 05 '19 at 12:41

answered Mar 04 '19 at 17:30

Ben.T

29,160
6
32
54

thanks for your solution, the results are not the same as the code I wrote – Rehim Alizadeh Mar 05 '19 at 07:35
@RehimAlizadeh Indeed, I had a mistypo (`df` instead of `acoustics`) in my code, but otherwse it gives the same values in the column with the sum than your code – Ben.T Mar 05 '19 at 12:41
after a few days I am again stuck with this issue, your solution is super fast and its really handy form me, but now I need to change this calculation a little bit. But I haven't really understood how your solution works therefore I am not able to do this by myself. Can you please explain what exactly each part does? – Rehim Alizadeh Mar 06 '19 at 17:35
1

@RehimAlizadeh so the method `diff` with a value `i` allows to calculate at once all the difference of each row with the value i rows before. The list compresion in `np.array` create an array in which each row represent the difference between the row with the i-th row before from the dataframe. The `T` transposes rows and columns and `[5:]` is to not select the first 5 rows to do the operation equivalent to your code `for i in range(5, len(acoustics))`. The `*np.array(weights)` will multiply each column, being the i-th difference, with the good weight in a vectorize way – Ben.T Mar 07 '19 at 11:26
1

@RehimAlizadeh finally, the `sum` will sum all the values on a same rows, that you can divide by the total of weight to normalize the result. All this code is used to create the values from the column summation and create the result dataframe with the good timestamp. I hope it helps. Let me know if you need more – Ben.T Mar 07 '19 at 11:30
1

Thank you very much for the explanation, it was very helpful! – Rehim Alizadeh Mar 07 '19 at 12:55

PiRK · Answer 2 · 2019-03-04T16:13:17.853

0

First optimization, weights[c]/sumWeights could be done outside the loop.

weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
        sum += tmp[c]*...

I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:

# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
    sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])

for i in range(len(acoustics)-5):
    deltaAc.append([time[5+i], sum[i])

edited Mar 04 '19 at 16:13

answered Mar 04 '19 at 16:07

PiRK

893
3
9
24

This code doesn't exactly do what I wanted, I want to find the sum (from 1 to 5) of (weight multiplied ( given point -previous point(from 1 to 5))) – Rehim Alizadeh Mar 04 '19 at 16:31
1

I'm unable to test the code at the moment, so if anyone wants to build upon my answer to fix what is wrong and get it to produce exactly the same result as the original loop, I wouldn't mind ;) Juggling with indices is always tricky. But the general idea is that for loops can generally be replaced by array operations (with adequate slicing when computing a difference). – PiRK Mar 04 '19 at 16:37

bubble · Answer 3 · 2019-03-04T23:39:26.260

0

Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:

# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())

edited Mar 04 '19 at 23:39

answered Mar 04 '19 at 23:32

bubble

1,634
12
17

thanks for your solution, it's super fast but the results are not the same – Rehim Alizadeh Mar 05 '19 at 07:34

How to make this for loop faster?

3 Answers3