0

In a recent project I had to serialize a large scikit-learn model (trained classifier) within a proprietary tool to persist it in a database. Due to restrictions in this tool the only way was to get the model into a string (using first pickle and then base64-encoding).

Another restriction was that the maximum string length for each cell was something around 10000. The base64-string of the model was about 150.000.000 characters.

My first approach using textwrap took multiple hours just for the wrapping, and even resorting to pandas did not help.

In the end I came up with a plain recursive python function splitting the string roughly in half, which is much fast than both library-approaches mentioned before. I have the feeling that I'm comparing apples to oranges, but non the less would be interested in some insight into this.

(full code below)

# plain python
%timeit w1 = wrap([teststr], target=100)
629 µs ± 85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# textwrap module
%timeit w2 = textwrap.wrap(teststr, 100)
24.3 ms ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas-based wrap
%timeit w3 = pd.Series(teststr).str.wrap(100).str.split("\n")
36.2 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now the main difference is that both library approaches split the string into equally sized chunks (until the very last chunk), while the recursive function splits it into chunks of two lenghts smaller than the target length.

My question: Can someone explain why the recursive approach with the string indexing is so much faster than the other two? Is there some implicit looping necessary for the equally sized chunks that makes this so slow? Are string indexing operations (as used in the custom wrap) so much faster?

Assuming I would need equally sized chunks (which I fortunately did not in this project), what method would give a better performance than both mentioned ones?

Full code to reproduce the examples:

from random import choice
from string import ascii_uppercase
from collections import Counter
import pandas as pd
import textwrap

teststr = ''.join(choice(ascii_uppercase) for i in range(100000))

def wrap(s, target=100):
    parts = []
    for t in s:
        if len(t) > target:
            idx = len(t) // 2
            parts.extend([t[:idx], t[idx:]])
        elif len(t) <= target:
            return s
    res = wrap(parts, target=target)
    return res


%timeit w1 = wrap([teststr], target=100)
%timeit w2 = textwrap.wrap(teststr, 100)
%timeit w3 = pd.Series(teststr).str.wrap(100).str.split("\n")

w1 = wrap([teststr], target=100)
s == "".join(w1)
w2 = textwrap.wrap(teststr, 100)
s == "".join(w2)
w3 = pd.Series(teststr).str.wrap(100).str.split("\n").iloc[0]
s == "".join(w3)

w1c = Counter((len(x) for x in w1))
w2c = Counter((len(x) for x in w2))
w3c = Counter((len(x) for x in w3))

print("Recursive wrap")
print(w1c)

print("textwrap")
print(w2c)

print("pandas wrap")
print(w3c)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
chris-sc
  • 1,698
  • 11
  • 21

1 Answers1

1

First, according to pandas docs, pd.Series.str.wrap calls textwrap so we can treat the two as effectively the same benchmark.

The rest is speculative. Looking at the source of textwrap, there are several options enabled by default that trigger a regex search of the string that is redundant in your case. For example,

expand_tabs=True
replace_whitespace=True
break_long_words=True
drop_whitespace=True
break_on_hyphens=True

You could try disabling any subset of those to see if performance improves. Overall, it seems that the textwrap module is designed with a different objective in mind from what you are trying to do.

Alternatively, you could try something along these lines.

hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51