0

I have a csv file that has 100 million rows and using a pc with 14GB of RAM. I have cut it into two parts of 50 million rows each. I have been waiting for two days just for the script to execute this code:

df['Column1']=df['Column1'].apply('{:0>7}'.format)
for index in df.index:
    if df.loc[index, 'Column2'] ==0.0 and df.loc[index,'Column3']==0:
        df.loc[index,'Column4'] = df.loc[index,'Column1'][:6]
    else:
        'F'

If there was a method to simplify that code, would that change the time to execute that code?

.   Column1    Column2   Column 3 Column4
0   5487964     1.0       2.0       F
1   5587694     0.0         0     558769
2   7934852     1.0         0       F
3   5487964     0.0       2.0       F
4   1111111     0.0         0     111111
5   5487964     1.0       2.0       F
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Moshe
  • 107
  • 1
  • 9

3 Answers3

3

The answer is yes, the way you process data has a strong relationship with the time to process it.

For example when you write codes that processes series the code runs way more faster than when you run it on rows. you can improve your code by using this code:

df.assign(Column4 = lambda x: np.where( (x['Column2'] == 0.0) & (x['Column3'] == 0.0), x['Column1'].str.slice(stop=6),'F'))

Because it works on series, not on each row

Moreover, you can use multi-threading and tqdm library to process faster and see the process progession. For further info have a look at this post.

Mehdi Golzadeh
  • 2,594
  • 1
  • 16
  • 28
1

At the place of pandas try to use numpy because you are processing lot's of data even a single improvement can help here. Navigate over numpy arrays is less time consuming in comparison of pandas Dataframe. Try and let us know whether this thing works for you then we can look for more bottlenecks.

1

This can make things faster. May be you can ensure Column1 as string when you read dataframe.

df['C5'] = np.where((df['Column1'] == 0) & (df['Column3'] == 0), df['Column1'][:6], 'F')