0

Using python 3.76, Pandas 1.0.1, on Intel 8700 cpu, in jupyter lab. This code

import time
import base64
import numpy as np
import pandas as pd
#read in data
words = pd.read_table("BigFile.txt",names=['word'],encoding="utf-8",header=None)
#add a column to hold encoded data
dfWords = pd.DataFrame(columns=['sb64'])
start = time.perf_counter() 
#encode data and add it to column
for x in range(10000):
    dfWords = dfWords.append({'sb64':base64.b64encode(words.word[x].encode('UTF-8')).decode()},ignore_index=True)
end = time.perf_counter() 
print(end - start, "seconds")
print("Dataframe Contents ", dfWords, sep='\n')

produces this output

12.090909591002855 seconds
Dataframe Contents 
                  sb64
0     ZGlmZmVyZW5jZQ==
1             d2hlcmU=
2                 bWM=
3                 aXM=
4                 dGhl
...                ...
9995      ZGl2ZXJ0b3I=
9996              aW4=
9997              d2Fz
9998              YXQ=
9999              dGhl

[10000 rows x 1 columns]

My datafile contains 10,000,000 lines so that is 1000x greater than this little test which took 12 seconds.

I tried to do this:

import time
import base64
dfWords = pd.DataFrame(columns=['sb64'])
start = time.perf_counter() 
dfWords['sb64']= words.word.str.encode('utf-8', 'strict').str.encode('base64').str.decode('utf-8', 'strict')
end = time.perf_counter() 
print(end - start, "seconds")
print(dfWords.tail())

but it failed with error

TypeError: Cannot use .str.encode with values of inferred dtype 'bytes'.

However if I downgrade Pandas to 0.23 it works and I can do a million entries in 4 seconds.

3.5494491000026756 seconds
                      sb64
999995  ZGlzdHJpYnV0aW9u\n
999996              aW4=\n
999997      c2NlbmFyaW8=\n
999998          bGVzcw==\n
999999          bGFuZA==\n

So my full 10 million line file would take 40 secs. Do I have to learn C as well as all that tooling to go faster? or is there a better Python way?

aquagremlin
  • 3,515
  • 2
  • 29
  • 51
  • is this a time sensitive operation? – gold_cy Feb 17 '20 at 22:31
  • " time sensitive " Is relative. But it is part of a larger data science project and I can see even larger datasets being used down the road. This post addresses my size concerns as well https://stackoverflow.com/questions/23569771/maximum-size-of-pandas-dataframe – aquagremlin Feb 17 '20 at 22:39
  • the two approaches are also different, I am not surprised that looping and appending to a dataframe is significantly slower than using the optimized `str` methods on a column directly – gold_cy Feb 17 '20 at 22:45

0 Answers0