0

I'm writing latency-critical application (homemade HFT trading system). I have such code which just convert uint64 to string:

    // TODO: cache sprintf, use strcpy? measure?
    sprintf(dest, "%" PRIu64, divRes.quot);

Here divRes.quot is integer number which guaranteed to be between 1 and 1 000 000. So I can preallocate (pretty big) array and "cache" every single value. Then I can just execute strcpy(dest, cache[divRes.quot]).

At first glance it must be significantly faster, because strcpy must be significantly faster, than sprintf. However note that I'm using huge array which almost surely can not be fully loaded to CPU cache. So second approach almost surely will go to main memory. While in first approach i'm pretty likely will stay in CPU cache (probably even in fastest L1 cache?!)

So in average what would be faster:

  • slow function in CPU cache
  • fast function with access to main memory?

I think it depends on how much faster one function than another and how much faster CPU cache access than main memory access.

I guess it's very hard to write a true test. Because in real application overall system load will be different and so cache/memory usage will be different and this can change things dramatically.

Please note I don't care about readability, maintance etc, I only need speed.

Oleg Vazhnev
  • 23,239
  • 54
  • 171
  • 305
  • 2
    Did you already measure it? That's actually the only way to find out for your particular OS/CPU. – πάντα ῥεῖ Feb 22 '15 at 21:49
  • not yet. i guess in simple test strcpy will be faster, however it doesn't mean that it will be faster in real application. – Oleg Vazhnev Feb 22 '15 at 21:50
  • You're correct, the cache behaviour in a complex system may be difficult to emulate with a synthetic test. None of us has access to your overall system, so I'm not sure what advice you're looking for here? – Oliver Charlesworth Feb 22 '15 at 21:51
  • `strcpy()` will of course be faster, it's doing much less than `sprintf()` actually. Think about parsing the format string alone. – πάντα ῥεῖ Feb 22 '15 at 21:52
  • I think you could write your own converter that would be faster than `sprintf()` given that `sprintf()` is very general purpose and your needs are very specific. – Galik Feb 22 '15 at 21:53
  • @πάνταῥεῖ: Once you take the horrific access pattern of the `strcpy` approach into account, it may well not be faster. – Oliver Charlesworth Feb 22 '15 at 21:54
  • 1
    Second question: Is this actually a bottleneck in your system? – Oliver Charlesworth Feb 22 '15 at 21:54
  • @OliverCharlesworth yes currently any "string operations" - bottleneck, because it's slow. I try to use integers only, double rarely, strings only when absolutely required – Oleg Vazhnev Feb 22 '15 at 21:56
  • 1
    @javapowered: Sure, but is it actually a significant proportion of your overall runtime? (Say more than 20%) i.e. is this premature optimisation? – Oliver Charlesworth Feb 22 '15 at 21:57
  • 1
    @OliverCharlesworth in hft trading "propotion" doesn't matter. it's like sport. if one guy finished in 2.00.001 and another in 2.00.002 - the first one is the winner and the second one is noone ) and I think it takes "some", pretty non-zero proportion. I.e. it makes sense to try to optimize this part. – Oleg Vazhnev Feb 22 '15 at 22:00
  • It does still matter. If it's only a few %, then there are likely other areas to be optimised that will give you a much bigger return on your effort. "I think it takes" implies you haven't actually measured it, so you've fallen straight into the "premature optimisation" trap ;) – Oliver Charlesworth Feb 22 '15 at 22:01
  • 1
    @OliverCharlesworth you have to optimize all such places. again and again and again ) Currently this is the one of the potentially slowest places I can find. – Oleg Vazhnev Feb 22 '15 at 22:02
  • 1
    This question has some interesting answers that may help: https://stackoverflow.com/questions/4351371/c-performance-challenge-integer-to-stdstring-conversion – Galik Feb 22 '15 at 22:03
  • 1
    @javapowered do you not have a test system you can do the performance test on? – user253751 Feb 22 '15 at 22:06
  • 1
    (If you're HFT trading *without* having a separate system to test on, prepare to be the next Knight Capital...) – user253751 Feb 22 '15 at 22:06
  • 1
    If you need to write really fast function to convert integers to string (which is actually simple), do it yourself, without using `strcpy/snprintf` than test results. I.e. you may want to unroll loop, and create two branches of loop so you will achieve registry independency. But all of that depends on CPU. – myaut Feb 22 '15 at 22:36
  • for what it's worth, since you know the value is between 0 and 1000000, use ints rather than 64 bit longs: `sprintf(dest, "%u", (unsigned)divRes.quot);` – chqrlie Feb 22 '15 at 22:49

1 Answers1

1

For the table lookup to work out well, you'd have to be doing this often enough (on a CPU with a large cache) for the majority of the table to be in cache most of the time you're doing it. The table occupies around 7 megabytes of memory, so unless the cache was quite large, and you were converting millions of numbers at a time so most of the accesses were to cache, it would almost certainly be a net loss.

By my figuring, it's likely to take around 100 clocks to convert a single number using normal division (~5 divisions + 6 additions). A read from main memory typically takes something like 200 processor clocks, so you'd need a cache hit rate around 50% just to break even.

Personally, I doubt I'd use either of those approaches. Instead, I'd probably do a hybrid. I'd divide the number by 1000, then do two table lookups (one with the dividend, the other with the remainder).

The advantage is that this reduces the table size to around 4 kilobytes, and increases the use of each table entry by a factor of about 1000. Assuming you're converting at least a few hundred (or so) randomly distributed numbers at a time, you can probably count on nearly 100% cache hit rate. With a high cache hit rate, we can plan on one division plus two loads from cache for a total of around 25 clocks, or roughly 4 times as fast as we can expect from a naive conversion.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • interesting suggestion, do you know how many clocks it takes to strcpy 4-char string? – Oleg Vazhnev Feb 23 '15 at 06:35
  • Since you'd know up front that you were copying 4 bytes, you'd want to read/write a 4-byte item (e.g., a `long`) rather than use `strcpy`. With a typical CPU, that would be one read (usually around 3 clocks for a cache hit) and one write (usually queued, so ~1 clock). – Jerry Coffin Feb 23 '15 at 07:00
  • I don't know the length exactly. I want to use 1000 items cache [0..999], so, with null terminating character the length will be between 2 and 4 – Oleg Vazhnev Feb 23 '15 at 07:02
  • @javapowered: If at all possible, make them all 4 characters long (e.g., pad with zeros). – Jerry Coffin Feb 23 '15 at 07:14