In a recent project I had to serialize a large scikit-learn model (trained classifier) within a proprietary tool to persist it in a database. Due to restrictions in this tool the only way was to get the model into a string (using first pickle and then base64-encoding).
Another restriction was that the maximum string length for each cell was something around 10000. The base64-string of the model was about 150.000.000 characters.
My first approach using textwrap took multiple hours just for the wrapping, and even resorting to pandas did not help.
In the end I came up with a plain recursive python function splitting the string roughly in half, which is much fast than both library-approaches mentioned before. I have the feeling that I'm comparing apples to oranges, but non the less would be interested in some insight into this.
(full code below)
# plain python
%timeit w1 = wrap([teststr], target=100)
629 µs ± 85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# textwrap module
%timeit w2 = textwrap.wrap(teststr, 100)
24.3 ms ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pandas-based wrap
%timeit w3 = pd.Series(teststr).str.wrap(100).str.split("\n")
36.2 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now the main difference is that both library approaches split the string into equally sized chunks (until the very last chunk), while the recursive function splits it into chunks of two lenghts smaller than the target length.
My question: Can someone explain why the recursive approach with the string indexing is so much faster than the other two? Is there some implicit looping necessary for the equally sized chunks that makes this so slow? Are string indexing operations (as used in the custom wrap) so much faster?
Assuming I would need equally sized chunks (which I fortunately did not in this project), what method would give a better performance than both mentioned ones?
Full code to reproduce the examples:
from random import choice
from string import ascii_uppercase
from collections import Counter
import pandas as pd
import textwrap
teststr = ''.join(choice(ascii_uppercase) for i in range(100000))
def wrap(s, target=100):
parts = []
for t in s:
if len(t) > target:
idx = len(t) // 2
parts.extend([t[:idx], t[idx:]])
elif len(t) <= target:
return s
res = wrap(parts, target=target)
return res
%timeit w1 = wrap([teststr], target=100)
%timeit w2 = textwrap.wrap(teststr, 100)
%timeit w3 = pd.Series(teststr).str.wrap(100).str.split("\n")
w1 = wrap([teststr], target=100)
s == "".join(w1)
w2 = textwrap.wrap(teststr, 100)
s == "".join(w2)
w3 = pd.Series(teststr).str.wrap(100).str.split("\n").iloc[0]
s == "".join(w3)
w1c = Counter((len(x) for x in w1))
w2c = Counter((len(x) for x in w2))
w3c = Counter((len(x) for x in w3))
print("Recursive wrap")
print(w1c)
print("textwrap")
print(w2c)
print("pandas wrap")
print(w3c)