8

I have a large string with length of the order 5*10^6.

I have to do some processing by dividing it into blocks of 16 characters. I used a custom made function to split the string assuming its performance would be better than the splice approach.

The functions are as follows :

def spliceSplitter(s):
     sum = 0
     while len(s) > 0:
             block = s[:16]
             # Assuming the process to be done with data block is calculating its length.
             sum += len(block)
             s = s[16:]
     return sum

And the custom Function :

def normalSplitter(s):
     sum = 0
     l = len(s)
     data =""
     for i in xrange(l):
             if i%16 == 0:
                     # Assuming the process to be done with data block is calculating its length.
                     sum += len(data)
                     data = ""
             data += s[i]
     return sum+len(data)

I used a cProfiler on both of them and the results are as follows (time in seconds) :

String Length     |  Splice Splitter   |  Normal Splitter
---------------------------------------------------------
5000000           |  289.0             |  1.274 
500000            |  0.592             |  0.134
50000             |  0.25              |  0.28
5000              |  0.001             |  0.003 

I am generating the string as follows :

s = ''.join([str(random.randint(1,9)) for x in xrange(5000000)])

My Question :

  • Is there a pythonic way to get the same or better Efficiency as the Custom Normal Splitter? Maybe splitting the entire string before hand, storing it into a list and then operating it iteratively.
  • Why is the performance for the Splice Splitter better for smaller strings ? (Just Curious about this one)

NOTE: The process(data) that I have to perform has no return value.

EDIT

Using Yield and improved Splice Splitter the performace resulted in the following :

String Length     |  Splice Splitter   |  Normal Splitter  |  Yield/Generator
-------------------------------------------------------------------------------
5000000           |  0.148             |  1.274            |  0.223
500000            |  0.016             |  0.134            |  0.29
50000             |  0.003             |  0.28             |  0.005
5000              |  ~0.000            |  0.003            |  ~0.000

Code :

def pythonicSplitter(s):
     gen = (s[i:i+16] for i in xrange(0,len(s),16))
     sum = 0
     for data in gen:
             sum += len(data)
     return sum
def spliceSplitter(s):
    sum = 0
    for x in xrange(0, len(s), 16):
         block = s[x:x+16]
         # Assuming the process to be done with data block is calculating its length.
         sum += len(block)
    return sum

Reason for improved Performance :

  • The Splice Splitter was repeatedly creating the new string in each iteration using splice. s = s[16:] as pointed out in Patrick Collin's answer. This was leading to a Time Complexity of ~O(N^2).
  • Once the repeated creation of the string s was replaced with s[x:x+16], the code complexity was reduced to O(N*16) thus improving its performance by a huge margin. The Yield/Generator Function does the same thing (pythonicSplitter()), but due to the Large number of calls to the generator(iterators), the time taken to complete the operation is slighly larger than the one used in Splice Splitter.
  • The Normal Splitter is also doing the same thing, by creating blocks of length 16. But since in Python the strings are immutable, the time complexity of creating these blocks is significantly larger than the inbuilt optimized slice function.
Community
  • 1
  • 1
Kyuubi
  • 1,228
  • 3
  • 18
  • 32
  • This question appears to be off-topic because it belongs on http://codereview.stackexchange.com – jonrsharpe Jun 16 '14 at 15:01
  • 3
    [As members of a community, your first loyalty should be to that community. When evaluating a question, you shouldn’t be looking to push it off on some other site; instead, ask if it could be appropriate and on-topic for you, the experts who the author decided to ask. **Be a bit jealous of your site**](http://blog.stackoverflow.com/2012/03/respect-the-community-your-own-and-others/) - Shog9. Just because something's on-topic on CR doesn't mean it's off-topic on SO @jonrsharpe; and anything to stop SO becoming "debug this for me". – Ben Jun 16 '14 at 15:02
  • @Ben debug what for whom?! If you disagree with my opinion, that's fine, don't vote for it; the lecture-by-proxy isn't helpful. *"Just because something's on-topic on CR doesn't mean it's off-topic on SO"* - true, but in this case my opinion was and is that this belongs on the former. – jonrsharpe Jun 16 '14 at 15:16

1 Answers1

6

I would guess that this line: s = s[16:] is causing s to be overwritten for every loop iteration, copying the whole string over. block = s[:16] is also copying a string, so you're essentially writing the string in to memory twice on every loop iteration. data = "" in normalSplitter() is also probably ensuring that you never keep more than 16 characters of your copy of the string in memory at one time, and you never do any copy operations on the whole string.

That's pushing a lot of data around, and, I expect, causing you to start getting cache misses on the largest size string (although evidently the smaller strings were able to fit inside the cache comfortably). Try using a solution like this one.

def newSplitter(s, n=16):
    for i in xrange(0, len(s), n):
        yield l[i:i+n]
Community
  • 1
  • 1
Patrick Collins
  • 10,306
  • 5
  • 30
  • 69