Does the manner of how a code is written affect the speed of how Pandas python process it?

Question

I have a csv file that has 100 million rows and using a pc with 14GB of RAM. I have cut it into two parts of 50 million rows each. I have been waiting for two days just for the script to execute this code:

df['Column1']=df['Column1'].apply('{:0>7}'.format)
for index in df.index:
    if df.loc[index, 'Column2'] ==0.0 and df.loc[index,'Column3']==0:
        df.loc[index,'Column4'] = df.loc[index,'Column1'][:6]
    else:
        'F'

If there was a method to simplify that code, would that change the time to execute that code?

.   Column1    Column2   Column 3 Column4
0   5487964     1.0       2.0       F
1   5587694     0.0         0     558769
2   7934852     1.0         0       F
3   5487964     0.0       2.0       F
4   1111111     0.0         0     111111
5   5487964     1.0       2.0       F

loops are very time consuming even apply too. You can use numpy 'where' to make it faster. "df['Column4'] = np.where(df.Column2 == 0 & df.Column3 ==0, df.Column1[:6], 'F')". But this will assign 'F' to the column4. — Siva Kumar Sunku, Oct 24 '20 at 13:24

Mehdi Golzadeh · Accepted Answer · 2020-10-24T17:29:08.647

3

The answer is yes, the way you process data has a strong relationship with the time to process it.

For example when you write codes that processes series the code runs way more faster than when you run it on rows. you can improve your code by using this code:

df.assign(Column4 = lambda x: np.where( (x['Column2'] == 0.0) & (x['Column3'] == 0.0), x['Column1'].str.slice(stop=6),'F'))

Because it works on series, not on each row

Moreover, you can use multi-threading and tqdm library to process faster and see the process progession. For further info have a look at this post.

edited Oct 24 '20 at 17:29

answered Oct 24 '20 at 13:38

Mehdi Golzadeh

2,594
1
16
28

does numpy allow me to split. Is `x['Column1'][:6]` ok? – Moshe Oct 24 '20 at 13:46
Is that cell strings? – Mehdi Golzadeh Oct 24 '20 at 13:48
Yes I have converted them to strings to allow me to split them – Moshe Oct 24 '20 at 13:49
1

Thanks man... good to know that your problem resolved. In future cases try to use NumPy to process but if you had to process rows, definitely try to use multithreading and tqdm. – Mehdi Golzadeh Oct 24 '20 at 13:53
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/223551/discussion-between-moshe-and-mhdg7). – Moshe Oct 24 '20 at 13:54
Please check discussion – Moshe Oct 24 '20 at 16:03

score 1 · Answer 2 · answered Oct 24 '20 at 13:32

At the place of pandas try to use numpy because you are processing lot's of data even a single improvement can help here. Navigate over numpy arrays is less time consuming in comparison of pandas Dataframe. Try and let us know whether this thing works for you then we can look for more bottlenecks.

score 1 · Answer 3 · answered Oct 24 '20 at 13:47

1

This can make things faster. May be you can ensure Column1 as string when you read dataframe.

df['C5'] = np.where((df['Column1'] == 0) & (df['Column3'] == 0), df['Column1'][:6], 'F')

answered Oct 24 '20 at 13:47

Siva Kumar Sunku

320
1
8

Does the manner of how a code is written affect the speed of how Pandas python process it?

3 Answers3