1

I have a large Pandas dataframe, 24'000'000 rows × 6 columns plus index. I need to read an integer in column 1 (which is = 1 or 2), then force the value in column 3 to be negative if column 1 = 1, or positive if = 2. I use the following code in Jupyter notebook:

for i in range(1000):
    if df.iloc[i,1] == 1:
        df.iloc[i,3] = abs(df.iloc[i,3])*(-1)
    if df.iloc[i,1] == 2:
        df.iloc[i,3] = abs(df.iloc[i,3])

The code above takes 2min 30sec to run for 1'000 rows only. For the 24M rows, it would take 41 days to complete !

Something is not right. The code runs in Jupyter Notebook/Chrome/Windows on a pretty high end PC.

The Pandas dataframe is created with pd.read_csv and then sorted and indexed this way:

df.sort_values(by = "My_time_stamp", ascending=True,inplace = True)
df = df.reset_index(drop=True)

The creation and sorting of the dataframe just takes a few seconds. I have other calculations to perform on this dataframe, so I clearly need to understand what I'm doing wrong.

Hugues
  • 197
  • 1
  • 5
  • 19

4 Answers4

3

np.where

a = np.where(df.iloc[:, 1].to_numpy() == 1, -1, 1)
b = np.abs(df.iloc[:, 3].to_numpy())
df.iloc[:, 3] = a * b
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • When I try this with a DataFrame, I get the error `'Series' object has no attribute 'to_numpy'`. Would you be willing to show your code with a data example? – Ruthger Righart Jan 09 '20 at 21:02
  • you are using an older version of pandas. Instead user `df.iloc[:, 1].values` and `df.iloc[:, 3].values` – piRSquared Jan 09 '20 at 21:08
  • Also had the time to test this proposal, it works and is actually the fastest, runs in around 0.51 seconds.Thanks / Edit: it is the fastest but df.iloc[:, 3] always comes out negative. Something's not right. – Hugues Jan 10 '20 at 20:08
  • ok, i found the problem, your condition in your where function always return True, so -1. I changed to : a = np.where(df.iloc[:, 1].to_numpy()==1, -1, 1) and it works, and is the fastest. – Hugues Jan 10 '20 at 20:30
2

Vectorize it:

df.iloc[:, 3] = df.iloc[:, 3].abs() * (2 * (df.iloc[:, 1] != 1) - 1)

Explanation:

Treated as int, boolean series df.iloc[:, 1] != 1 gets converted to ones and zeroes. Multiplied by 2, it gets twos and zeroes. After subtracting one, it gets -1 where the first column is 1, and 1 otherwise. Finally, it is multiplied by the absolute value of the third column, which enforces the sign.

Vectorization typically provides an order of magnitude or two speedup comparing to for loops.

ansev
  • 30,322
  • 5
  • 17
  • 31
Marat
  • 15,215
  • 2
  • 39
  • 48
  • It actually runs slower. 1 min for 100 records, so I assume 10 min for 1'000. So to be sure, I have replaced my 2 IF blocks (4 lines above) with your single line. – Hugues Jan 09 '20 at 20:04
  • @Hugues no need to loop for with this method, your 5 lines of code can be replaced – Ben.T Jan 09 '20 at 20:07
  • 1
    Sorry, my bad, I see it now, indeed, with Marat's line alone it went through the whole 24M rows in about a second ! Talk about improvement, I love my PC again. I need to research vectorization to better understand it as I will really need it for the rest. – Hugues Jan 09 '20 at 20:15
  • This proposal works and is the second fastest at 0.57 seconds – Hugues Jan 10 '20 at 20:33
1

Use

df.iloc[:,3] = df.iloc[:,3].abs().mul( df.iloc[:,-1].map({2:1,1:-1}) )
ansev
  • 30,322
  • 5
  • 17
  • 31
  • 1
    returns: AttributeError: 'DataFrame' object has no attribute 'map' WHich I have researched a bit and found this but not that clear to me: https://stackoverflow.com/questions/39535447/attributeerror-dataframe-object-has-no-attribute-map – Hugues Jan 09 '20 at 20:23
  • I like it, `.map()` doesn't get enough love. – AMC Jan 09 '20 at 23:01
  • ok, i had time to test this proposal tonight, the line above works if you correct the typo, it's df.iloc[:,1], not df.iloc[:,-1]. It runs in about 0.88 second for 24M rows, whereas Marat's proposal runs faster in around 0.58 second. Does not make a bit difference for now, but my dataframe is growing at a rate of 1M rows per day. But the .map instruction is easier to grasp for my bird brain. Thanks. – Hugues Jan 10 '20 at 20:04
0

Another way to do this:

import pandas as pd

Take an example data set:

df = pd.DataFrame({'x1':[1,2,1,2], 'x2':[4,8,1,2]})

Make new column, code values as -1 and +1:

df['nx1'] = df['x1'].replace({1:-1, 2:1})

Multiply columnwise:

df['nx1'] * df['x2']
Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33