How to optimize strtok + atoll

Question

Is there any good way to optimize this function in terms of execution time? My final goal is to parse a long string composed of several integers (thousands of integer per line, and thousands of lines). This was my initial solution.

int64_t get_next_int(char *newLine) {
    char *token=strtok(newLine, " ");
    if( token == NULL ) {
        exit(0);
    }
    return atoll(token);
}

More details: I need the "state" based implementation of strtok, so the padding implemented by strtok should exist in the final string. Atoll does not need of any kind of verification.

Target system: Intel x86_64 (Xeon series)

Related topics:

atoi optimization: C++ most efficient way to convert string to int (faster than atoi)

*Optimize* by what metric? Beauty? Memory bandwidth? Memory usage? Speed on what machine? Unclear. — Marcus Müller, Jun 28 '16 at 09:00
it seems at first parsing(before space character) it will returun 0 if it is non-number entry or combined alphabetic and number in such a way that alphabetic at beginning . If combined and number at beginning, it will return the number merely. So, it is not safe code for me. Without safety, It shouldn't be optimized. — Soner from The Ottoman Empire, Jun 28 '16 at 09:21
You might want to specify in words what you are trying to achieve. Is the side-effect of `strtok` (stuffing a `\0` byte somewhere inside the input string) desirable? Is the behavior of `atoll` in case of errors desirable? Do you want any specific behavior in case of errors at all? Also, your question has a link to a very similar question with answers. You should specify how your question is different from the other one, and why the answers to the other question are not sufficient. — anatolyg, Jun 28 '16 at 09:35
I think your code is little gobbledygook. Show what you exactly want without simplification. — Soner from The Ottoman Empire, Jun 28 '16 at 09:50

score 3 · Accepted Answer · edited May 23 '17 at 12:03

First off: I find optimizing string conversion routines in signal processing chains most of the time to be totally in vain. The speed at which your system loads data in string form (which will probably happen from some mass storage, where it was put by something that didn't care about performance, since it wouldn't have chosen a string format in the first place, otherwise), and if you compare read speeds of all but clusters of SSDs attached via PCIe with how fast atoll is, you'll notice that you're losing a negligible amount of time on inefficient conversion. If you pipeline loading parts of that string with conversion, the time spent waiting for storage will not even be remotely filled up with converting, so even without any algorithmic optimization, pipelining/multi-threading will eliminate practically all time spent on conversion.

I'm going to go ahead and assume your integer-containing string is sufficiently large. Like, tens of millions of integers. Otherwise, all optimization might be pretty premature, considering there's little to complain about std::iostream performance.

Now, the trick is that no performance optimization can be done once the performance of your conversion routine hits the memory bandwidth barrier. To push that barrier as far as possible, it's crucial to optimize usage of CPU caches – hence, doing linear access and shuffling memory as little as possible is crucial here. Also, if you care for speed, you don't want to call a function every time you need to convert a few-digit number – the call overhead (saving/restoring stack, jumping back and forth) will be significant. So if you're after performance, you'll do the conversion of the whole string at once, and then just access the resulting integer array.

So you'd have roughly something like, on a modern, SSE4.2 capable x86 processor

Outer loop, jumps in steps of 16:

load 128 bit of input string into 128 bit SIMD register
run something like __mm_cmpestri to find indices of delimiters and \0 terminator in all these 16 bytes at once
inner loop over the found indices
- Use SSE copy/shift/immediate instructions to isolate substrings; fill the others with 0
- prepend saved "last characters" from previous iteration (if any – should only be the case for first inner loop iteration per outer loop iteration)
- subtract 0 from each of the digits, again using SSE instructions to do up to 16 subtractions with a single instruction (_mm_sub_epi8)
- convert the eight 16bit subwords to eight 128 bit words containing two packed 64bit integers each (one instruction per 16bit, _mm_cvtepi8_epi64, I think)
- initialize a __mm128 register with [10^15 10^14], let's call it powers
- loop over pairs dual-64bit words: (each step should be one SSE instruction)
  - multiply first with powers
  - divide powers by [100 100]
  - multiply second with powers
  - add results to dual-64bit accumulator
- sum the two values in accumulator
- store the result to integer array

That was an interesting point. In fact, I already use some pipelining with asynchronous IO read. Also, avoiding function call was another good starting point to optimize. One additional feature I use is that while there are no data available (system is still loading strings) I perform some other stuff required in my software - and I've found my strtok+stoi being much slower compared to other parts of the system. — rph, Jun 29 '16 at 09:18

score 2 · Answer 2 · edited May 23 '17 at 10:29

2

I'd rather use something along the lines of a std::istringstream:

int64_t get_next_int(std::istringstream& line) {
    int64_t token;
    if(!(line >> token))
         exit(0);
    return token;
}


 std::istringstream line(newLine);

 int64_t i = get_next_int(line);

strtok() has well known drawbacks, and you don't want to use it at all.

edited May 23 '17 at 10:29

Community

1
1

answered Jun 28 '16 at 08:57

πάντα ῥεῖ

1
13
116
190

What about the issues regarding converting a cstring to isstringstream type? Does this include considerable overhead? – rph Jun 28 '16 at 09:05
1

@rkioji _"Does this include considerable overhead?"_ Usually not. – πάντα ῥεῖ Jun 28 '16 at 09:07
How does this compare to sscanf? – rph Jun 28 '16 at 09:13
1

@rkioji At least in better type safety. You might want to read [this](http://stackoverflow.com/questions/1042110/using-scanf-in-c-programs-is-faster-than-using-cin) before asking again. – πάντα ῥεῖ Jun 28 '16 at 09:16
@πάνταῥεῖ great link; referred to that answer pretty often outside SO, and now, [finally in an answer on SO](http://stackoverflow.com/a/38076565/4433386). – Marcus Müller Jun 28 '16 at 12:42

score 1 · Answer 3 · 2016-06-28T09:37:00.273

1

What about

int n= 0;

// Find the token
for ( ; *newline == ' '; newline++)
  ;
if (*newline == 0)
  // Not found
  exit(0);

// Scan and convert the token
for ( ; unsigned(*newline - '0') < 10; newline++)
  n= 10 * n + *newline - '0';
return n;

edited Jun 28 '16 at 09:37

answered Jun 28 '16 at 09:17

score 1 · Answer 4 · answered Jun 28 '16 at 09:47

AFA I get from your code at first splitting it will return. It seems at first parsing(before space character) it will returun 0 if it is non-number entry or combined alphabetic and number in such a way that alphabetic at beginning . If combined and number at beginning, it will return the number merely. Namely, you just need a string for the conversion. So you don't need tokenizing just check the string is null or not. You can change return type as well. Because, if you need a type with _exactly_ 64 bits, use (u)int64_t, if you need _at least_ 64 bits, (unsigned) long long is perfectly fine, as would be (u)int_least64_t. I think your code is little gobbledygook. Show what you exactly want without simplification.

/*
 * ascii-to-longlong conversion
 *
 * no error checking; assumes decimal digits
 *
 * efficient conversion: 
 *   start with value = 0
 *   then, starting at first character, repeat the following
 *   until the end of the string:
 *
 *     new value = (10 * (old value)) + decimal value of next character
 *
 */
long long my_atoll(char *instr)
{
  if(str[0] == '\0')
    return -1;

  long long retval;
  int i;

  retval = 0;
  for (; *instr; instr++) {
    retval = 10*retval + (*instr - '0');
  }
  return retval;
}

How to optimize strtok + atoll

4 Answers4