I have a large string with length of the order 5*10^6
.
I have to do some processing by dividing it into blocks of 16 characters. I used a custom made function to split the string assuming its performance would be better than the splice approach.
The functions are as follows :
def spliceSplitter(s):
sum = 0
while len(s) > 0:
block = s[:16]
# Assuming the process to be done with data block is calculating its length.
sum += len(block)
s = s[16:]
return sum
And the custom Function :
def normalSplitter(s):
sum = 0
l = len(s)
data =""
for i in xrange(l):
if i%16 == 0:
# Assuming the process to be done with data block is calculating its length.
sum += len(data)
data = ""
data += s[i]
return sum+len(data)
I used a cProfiler on both of them and the results are as follows (time in seconds) :
String Length | Splice Splitter | Normal Splitter
---------------------------------------------------------
5000000 | 289.0 | 1.274
500000 | 0.592 | 0.134
50000 | 0.25 | 0.28
5000 | 0.001 | 0.003
I am generating the string as follows :
s = ''.join([str(random.randint(1,9)) for x in xrange(5000000)])
My Question :
- Is there a pythonic way to get the same or better Efficiency as the Custom Normal Splitter? Maybe splitting the entire string before hand, storing it into a list and then operating it iteratively.
- Why is the performance for the Splice Splitter better for smaller strings ? (Just Curious about this one)
NOTE: The process(data)
that I have to perform has no return value.
EDIT
Using Yield and improved Splice Splitter the performace resulted in the following :
String Length | Splice Splitter | Normal Splitter | Yield/Generator
-------------------------------------------------------------------------------
5000000 | 0.148 | 1.274 | 0.223
500000 | 0.016 | 0.134 | 0.29
50000 | 0.003 | 0.28 | 0.005
5000 | ~0.000 | 0.003 | ~0.000
Code :
def pythonicSplitter(s):
gen = (s[i:i+16] for i in xrange(0,len(s),16))
sum = 0
for data in gen:
sum += len(data)
return sum
def spliceSplitter(s):
sum = 0
for x in xrange(0, len(s), 16):
block = s[x:x+16]
# Assuming the process to be done with data block is calculating its length.
sum += len(block)
return sum
Reason for improved Performance :
- The Splice Splitter was repeatedly creating the new string in each iteration using splice.
s = s[16:]
as pointed out in Patrick Collin's answer. This was leading to a Time Complexity of~O(N^2)
. - Once the repeated creation of the string
s
was replaced withs[x:x+16]
, the code complexity was reduced toO(N*16)
thus improving its performance by a huge margin. The Yield/Generator Function does the same thing (pythonicSplitter()
), but due to the Large number of calls to the generator(iterators), the time taken to complete the operation is slighly larger than the one used in Splice Splitter. - The Normal Splitter is also doing the same thing, by creating blocks of length 16. But since in Python the strings are immutable, the time complexity of creating these blocks is significantly larger than the inbuilt optimized slice function.