First off: I find optimizing string conversion routines in signal processing chains most of the time to be totally in vain. The speed at which your system loads data in string form (which will probably happen from some mass storage, where it was put by something that didn't care about performance, since it wouldn't have chosen a string format in the first place, otherwise), and if you compare read speeds of all but clusters of SSDs attached via PCIe with how fast atoll
is, you'll notice that you're losing a negligible amount of time on inefficient conversion. If you pipeline loading parts of that string with conversion, the time spent waiting for storage will not even be remotely filled up with converting, so even without any algorithmic optimization, pipelining/multi-threading will eliminate practically all time spent on conversion.
I'm going to go ahead and assume your integer-containing string is sufficiently large. Like, tens of millions of integers. Otherwise, all optimization might be pretty premature, considering there's little to complain about std::iostream
performance.
Now, the trick is that no performance optimization can be done once the performance of your conversion routine hits the memory bandwidth barrier. To push that barrier as far as possible, it's crucial to optimize usage of CPU caches – hence, doing linear access and shuffling memory as little as possible is crucial here. Also, if you care for speed, you don't want to call a function every time you need to convert a few-digit number – the call overhead (saving/restoring stack, jumping back and forth) will be significant. So if you're after performance, you'll do the conversion of the whole string at once, and then just access the resulting integer array.
So you'd have roughly something like, on a modern, SSE4.2 capable x86 processor
Outer loop, jumps in steps of 16:
- load 128 bit of input string into 128 bit SIMD register
- run something like
__mm_cmpestri
to find indices of
delimiters and \0
terminator in all these 16 bytes at once
- inner loop over the found indices
- Use SSE copy/shift/immediate instructions to isolate substrings; fill the others with
0
- prepend saved "last characters" from previous iteration (if any – should only be the case for first inner loop iteration per outer loop iteration)
- subtract
0
from each of the digits, again using SSE instructions to do up to 16 subtractions with a single instruction (_mm_sub_epi8
)
- convert the eight 16bit subwords to eight 128 bit words containing two packed 64bit integers each (one instruction per 16bit,
_mm_cvtepi8_epi64
, I think)
- initialize a
__mm128
register with [10^15 10^14]
, let's call it powers
- loop over pairs dual-64bit words: (each step should be one SSE instruction)
- multiply first with
powers
- divide powers by
[100 100]
- multiply second with
powers
- add results to dual-64bit accumulator
- sum the two values in accumulator
store
the result to integer array