speed up dataframe encoding loop

Question

Using python 3.76, Pandas 1.0.1, on Intel 8700 cpu, in jupyter lab. This code

import time
import base64
import numpy as np
import pandas as pd
#read in data
words = pd.read_table("BigFile.txt",names=['word'],encoding="utf-8",header=None)
#add a column to hold encoded data
dfWords = pd.DataFrame(columns=['sb64'])
start = time.perf_counter() 
#encode data and add it to column
for x in range(10000):
    dfWords = dfWords.append({'sb64':base64.b64encode(words.word[x].encode('UTF-8')).decode()},ignore_index=True)
end = time.perf_counter() 
print(end - start, "seconds")
print("Dataframe Contents ", dfWords, sep='\n')

produces this output

12.090909591002855 seconds
Dataframe Contents 
                  sb64
0     ZGlmZmVyZW5jZQ==
1             d2hlcmU=
2                 bWM=
3                 aXM=
4                 dGhl
...                ...
9995      ZGl2ZXJ0b3I=
9996              aW4=
9997              d2Fz
9998              YXQ=
9999              dGhl

[10000 rows x 1 columns]

My datafile contains 10,000,000 lines so that is 1000x greater than this little test which took 12 seconds.

I tried to do this:

import time
import base64
dfWords = pd.DataFrame(columns=['sb64'])
start = time.perf_counter() 
dfWords['sb64']= words.word.str.encode('utf-8', 'strict').str.encode('base64').str.decode('utf-8', 'strict')
end = time.perf_counter() 
print(end - start, "seconds")
print(dfWords.tail())

but it failed with error

TypeError: Cannot use .str.encode with values of inferred dtype 'bytes'.

However if I downgrade Pandas to 0.23 it works and I can do a million entries in 4 seconds.

3.5494491000026756 seconds
                      sb64
999995  ZGlzdHJpYnV0aW9u\n
999996              aW4=\n
999997      c2NlbmFyaW8=\n
999998          bGVzcw==\n
999999          bGFuZA==\n

So my full 10 million line file would take 40 secs. Do I have to learn C as well as all that tooling to go faster? or is there a better Python way?

" time sensitive " Is relative. But it is part of a larger data science project and I can see even larger datasets being used down the road. This post addresses my size concerns as well https://stackoverflow.com/questions/23569771/maximum-size-of-pandas-dataframe — aquagremlin, Feb 17 '20 at 22:39
the two approaches are also different, I am not surprised that looping and appending to a dataframe is significantly slower than using the optimized `str` methods on a column directly — gold_cy, Feb 17 '20 at 22:45

speed up dataframe encoding loop

0 Answers0