I have a large Pandas dataframe, 24'000'000 rows × 6 columns plus index. I need to read an integer in column 1 (which is = 1 or 2), then force the value in column 3 to be negative if column 1 = 1, or positive if = 2. I use the following code in Jupyter notebook:
for i in range(1000):
if df.iloc[i,1] == 1:
df.iloc[i,3] = abs(df.iloc[i,3])*(-1)
if df.iloc[i,1] == 2:
df.iloc[i,3] = abs(df.iloc[i,3])
The code above takes 2min 30sec to run for 1'000 rows only. For the 24M rows, it would take 41 days to complete !
Something is not right. The code runs in Jupyter Notebook/Chrome/Windows on a pretty high end PC.
The Pandas dataframe is created with pd.read_csv and then sorted and indexed this way:
df.sort_values(by = "My_time_stamp", ascending=True,inplace = True)
df = df.reset_index(drop=True)
The creation and sorting of the dataframe just takes a few seconds. I have other calculations to perform on this dataframe, so I clearly need to understand what I'm doing wrong.