How to speed up this task in Python

Question

I have a large Pandas dataframe, 24'000'000 rows × 6 columns plus index. I need to read an integer in column 1 (which is = 1 or 2), then force the value in column 3 to be negative if column 1 = 1, or positive if = 2. I use the following code in Jupyter notebook:

for i in range(1000):
    if df.iloc[i,1] == 1:
        df.iloc[i,3] = abs(df.iloc[i,3])*(-1)
    if df.iloc[i,1] == 2:
        df.iloc[i,3] = abs(df.iloc[i,3])

The code above takes 2min 30sec to run for 1'000 rows only. For the 24M rows, it would take 41 days to complete !

Something is not right. The code runs in Jupyter Notebook/Chrome/Windows on a pretty high end PC.

The Pandas dataframe is created with pd.read_csv and then sorted and indexed this way:

df.sort_values(by = "My_time_stamp", ascending=True,inplace = True)
df = df.reset_index(drop=True)

The creation and sorting of the dataframe just takes a few seconds. I have other calculations to perform on this dataframe, so I clearly need to understand what I'm doing wrong.

You are doing exactly what many pandanistas would suggest you not do. Don't use for loops to do math. — piRSquared, Jan 09 '20 at 19:54

piRSquared · Accepted Answer · 2020-01-10T20:23:34.203

3

`np.where`

a = np.where(df.iloc[:, 1].to_numpy() == 1, -1, 1)
b = np.abs(df.iloc[:, 3].to_numpy())
df.iloc[:, 3] = a * b

edited Jan 10 '20 at 20:23

answered Jan 09 '20 at 19:57

piRSquared

285,575
57
475
624

When I try this with a DataFrame, I get the error `'Series' object has no attribute 'to_numpy'`. Would you be willing to show your code with a data example? – Ruthger Righart Jan 09 '20 at 21:02
you are using an older version of pandas. Instead user `df.iloc[:, 1].values` and `df.iloc[:, 3].values` – piRSquared Jan 09 '20 at 21:08
Also had the time to test this proposal, it works and is actually the fastest, runs in around 0.51 seconds.Thanks / Edit: it is the fastest but df.iloc[:, 3] always comes out negative. Something's not right. – Hugues Jan 10 '20 at 20:08
ok, i found the problem, your condition in your where function always return True, so -1. I changed to : a = np.where(df.iloc[:, 1].to_numpy()==1, -1, 1) and it works, and is the fastest. – Hugues Jan 10 '20 at 20:30

score 2 · Answer 2 · edited Jan 09 '20 at 21:17

2

Vectorize it:

df.iloc[:, 3] = df.iloc[:, 3].abs() * (2 * (df.iloc[:, 1] != 1) - 1)

Explanation:

Treated as int, boolean series df.iloc[:, 1] != 1 gets converted to ones and zeroes. Multiplied by 2, it gets twos and zeroes. After subtracting one, it gets -1 where the first column is 1, and 1 otherwise. Finally, it is multiplied by the absolute value of the third column, which enforces the sign.

Vectorization typically provides an order of magnitude or two speedup comparing to for loops.

edited Jan 09 '20 at 21:17

ansev

30,322
5
17
31

answered Jan 09 '20 at 19:54

Marat

15,215
2
39
48

It actually runs slower. 1 min for 100 records, so I assume 10 min for 1'000. So to be sure, I have replaced my 2 IF blocks (4 lines above) with your single line. – Hugues Jan 09 '20 at 20:04
@Hugues no need to loop for with this method, your 5 lines of code can be replaced – Ben.T Jan 09 '20 at 20:07
1

Sorry, my bad, I see it now, indeed, with Marat's line alone it went through the whole 24M rows in about a second ! Talk about improvement, I love my PC again. I need to research vectorization to better understand it as I will really need it for the rest. – Hugues Jan 09 '20 at 20:15
This proposal works and is the second fastest at 0.57 seconds – Hugues Jan 10 '20 at 20:33

ansev · Answer 3 · 2020-01-09T20:30:12.300

1

Use

df.iloc[:,3] = df.iloc[:,3].abs().mul( df.iloc[:,-1].map({2:1,1:-1}) )

edited Jan 09 '20 at 20:30

answered Jan 09 '20 at 19:58

ansev

30,322
5
17
31

1

returns: AttributeError: 'DataFrame' object has no attribute 'map' WHich I have researched a bit and found this but not that clear to me: https://stackoverflow.com/questions/39535447/attributeerror-dataframe-object-has-no-attribute-map – Hugues Jan 09 '20 at 20:23
I like it, `.map()` doesn't get enough love. – AMC Jan 09 '20 at 23:01
ok, i had time to test this proposal tonight, the line above works if you correct the typo, it's df.iloc[:,1], not df.iloc[:,-1]. It runs in about 0.88 second for 24M rows, whereas Marat's proposal runs faster in around 0.58 second. Does not make a bit difference for now, but my dataframe is growing at a rate of 1M rows per day. But the .map instruction is easier to grasp for my bird brain. Thanks. – Hugues Jan 10 '20 at 20:04

score 0 · Answer 4 · answered Jan 09 '20 at 20:04

0

Another way to do this:

import pandas as pd

Take an example data set:

df = pd.DataFrame({'x1':[1,2,1,2], 'x2':[4,8,1,2]})

Make new column, code values as -1 and +1:

df['nx1'] = df['x1'].replace({1:-1, 2:1})

Multiply columnwise:

df['nx1'] * df['x2']

answered Jan 09 '20 at 20:04

Ruthger Righart

4,799
2
28
33

How to speed up this task in Python

4 Answers4

np.where

`np.where`