Using python 3.76, Pandas 1.0.1, on Intel 8700 cpu, in jupyter lab. This code
import time
import base64
import numpy as np
import pandas as pd
#read in data
words = pd.read_table("BigFile.txt",names=['word'],encoding="utf-8",header=None)
#add a column to hold encoded data
dfWords = pd.DataFrame(columns=['sb64'])
start = time.perf_counter()
#encode data and add it to column
for x in range(10000):
dfWords = dfWords.append({'sb64':base64.b64encode(words.word[x].encode('UTF-8')).decode()},ignore_index=True)
end = time.perf_counter()
print(end - start, "seconds")
print("Dataframe Contents ", dfWords, sep='\n')
produces this output
12.090909591002855 seconds
Dataframe Contents
sb64
0 ZGlmZmVyZW5jZQ==
1 d2hlcmU=
2 bWM=
3 aXM=
4 dGhl
... ...
9995 ZGl2ZXJ0b3I=
9996 aW4=
9997 d2Fz
9998 YXQ=
9999 dGhl
[10000 rows x 1 columns]
My datafile contains 10,000,000 lines so that is 1000x greater than this little test which took 12 seconds.
I tried to do this:
import time
import base64
dfWords = pd.DataFrame(columns=['sb64'])
start = time.perf_counter()
dfWords['sb64']= words.word.str.encode('utf-8', 'strict').str.encode('base64').str.decode('utf-8', 'strict')
end = time.perf_counter()
print(end - start, "seconds")
print(dfWords.tail())
but it failed with error
TypeError: Cannot use .str.encode with values of inferred dtype 'bytes'.
However if I downgrade Pandas to 0.23 it works and I can do a million entries in 4 seconds.
3.5494491000026756 seconds
sb64
999995 ZGlzdHJpYnV0aW9u\n
999996 aW4=\n
999997 c2NlbmFyaW8=\n
999998 bGVzcw==\n
999999 bGFuZA==\n
So my full 10 million line file would take 40 secs. Do I have to learn C as well as all that tooling to go faster? or is there a better Python way?