2

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster. I have a pandas dataframe called "acoustics" which contains over 10 million rows:

print(acoustics)
                        timestamp            c0  rowIndex
0        2016-01-01T00:00:12.000Z  13931.500000   8158791
1        2016-01-01T00:00:30.000Z  14084.099609   8158792
2        2016-01-01T00:00:48.000Z  13603.400391   8158793
3        2016-01-01T00:01:06.000Z  13977.299805   8158794
4        2016-01-01T00:01:24.000Z  13611.000000   8158795
5        2016-01-01T00:02:18.000Z  13695.000000   8158796
6        2016-01-01T00:02:36.000Z  13809.400391   8158797
7        2016-01-01T00:02:54.000Z  13756.000000   8158798

and there is the code I wrote:

acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
    time = acoustics.iloc[i]['timestamp']
    sum = 0
    for c in range(5):
        sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
    print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
    deltaAc.append([time, sum])

deltaAc = pd.DataFrame(deltaAc)

It takes a huge amount of time, how can I make it faster?

Rehim Alizadeh
  • 319
  • 1
  • 3
  • 12

3 Answers3

1

You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:

deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'], 
                       'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
                                               *np.array(weights)).sum(1)/sumWeights})

and you get the same values than what I get with your code:

print (deltaAc)
                  timestamp  summation
5  2016-01-01T00:02:18.000Z -41.799986
6  2016-01-01T00:02:36.000Z  51.418728
7  2016-01-01T00:02:54.000Z  -3.111184
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • thanks for your solution, the results are not the same as the code I wrote – Rehim Alizadeh Mar 05 '19 at 07:35
  • @RehimAlizadeh Indeed, I had a mistypo (`df` instead of `acoustics`) in my code, but otherwse it gives the same values in the column with the sum than your code – Ben.T Mar 05 '19 at 12:41
  • after a few days I am again stuck with this issue, your solution is super fast and its really handy form me, but now I need to change this calculation a little bit. But I haven't really understood how your solution works therefore I am not able to do this by myself. Can you please explain what exactly each part does? – Rehim Alizadeh Mar 06 '19 at 17:35
  • 1
    @RehimAlizadeh so the method `diff` with a value `i` allows to calculate at once all the difference of each row with the value i rows before. The list compresion in `np.array` create an array in which each row represent the difference between the row with the i-th row before from the dataframe. The `T` transposes rows and columns and `[5:]` is to not select the first 5 rows to do the operation equivalent to your code `for i in range(5, len(acoustics))`. The `*np.array(weights)` will multiply each column, being the i-th difference, with the good weight in a vectorize way – Ben.T Mar 07 '19 at 11:26
  • 1
    @RehimAlizadeh finally, the `sum` will sum all the values on a same rows, that you can divide by the total of weight to normalize the result. All this code is used to create the values from the column summation and create the result dataframe with the good timestamp. I hope it helps. Let me know if you need more – Ben.T Mar 07 '19 at 11:30
  • 1
    Thank you very much for the explanation, it was very helpful! – Rehim Alizadeh Mar 07 '19 at 12:55
0

First optimization, weights[c]/sumWeights could be done outside the loop.

weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
        sum += tmp[c]*...

I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:

# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
    sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])

for i in range(len(acoustics)-5):
    deltaAc.append([time[5+i], sum[i])
PiRK
  • 893
  • 3
  • 9
  • 24
  • This code doesn't exactly do what I wanted, I want to find the sum (from 1 to 5) of (weight multiplied ( given point -previous point(from 1 to 5))) – Rehim Alizadeh Mar 04 '19 at 16:31
  • 1
    I'm unable to test the code at the moment, so if anyone wants to build upon my answer to fix what is wrong and get it to produce exactly the same result as the original loop, I wouldn't mind ;) Juggling with indices is always tricky. But the general idea is that for loops can generally be replaced by array operations (with adequate slicing when computing a difference). – PiRK Mar 04 '19 at 16:37
0

Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:

# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())
bubble
  • 1,634
  • 12
  • 17