6

This function reads an array of doubles from a string:

vector<double> parseVals(string& str) {
    stringstream ss(str);
    vector<double> vals;
    double val;
    while (ss >> val) vals.push_back(val);
    return vals;
}

When called with a string containing 1 million numbers, the function takes 7.8 seconds to execute (Core i5, 3.3GHz). This means that 25000 CPU cycles are spent to parse ONE NUMBER.

user315052 has pointed out that the same code runs an order of magnitude faster on his system, and further testing has shown very large performance differences among different systems and compilers (also see user315052's answer):

1. Win7, Visual Studio 2012RC or Intel C++ 2013 beta: 7.8  sec
2. Win7, mingw / g++ 4.5.2                          : 4    sec
3. Win7, Visual Studio 2010                         : 0.94 sec
4. Ubuntu 12.04, g++ 4.7                            : 0.65 sec

I have found a great alternative in the Boost/Spirit library. The code is safe, concise and extremely fast (0.06 seconds on VC2012, 130x faster than stringstream).

#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;

vector<double> parseVals4(string& str) {
    vector<double> vals;
    qi::phrase_parse(str.begin(), str.end(),
        *qi::double_ >> qi::eoi, ascii::space, vals);
    return vals;
}

Although this solves the problem from the practical standpoint, i would still like to know why the performance of stringstream is so inconsistent. I profiled the program to identify the bottleneck, but the STL code looks like gibberish to me. Comments from anybody familiar with STL internals would be much appreciated.

PS: Optimization is O2 or better in all of the above timings. Neither instantiation of stringstream nor the reallocation of vector figure in the program profile. Virtually all of the time is spent inside the extraction operator.

untraceable
  • 73
  • 1
  • 7
  • 4
    I know this isn't an answer to your question, but if performance of such a routine is important to you, you probably want to preallocate your vector using `reserve()`. – Turix Jul 12 '12 at 05:55
  • 4
    *"When called with a string containing 1 million numbers, the function takes 7.8 seconds to execute (Core i5, 3.3GHz). This means that 25000 CPU cycles are spent to parse ONE NUMBER."* Sorry, but you can't calculate cycles that way. I do agree that string parsing is expensive; which is why you should generally store large numbers of numbers in a binary representation. – smocking Jul 12 '12 at 06:01
  • @Turix: preallocation is definitely not the problem here. – untraceable Jul 12 '12 at 06:10
  • @smocking: 1) In my experience you can calculate cycles this way. Unless you access memory all over the place, the CPU cxecutes about one instruction per cycle (pipelining etc.). – untraceable Jul 12 '12 at 06:16
  • 9
    @untraceable, there is a world of difference between [FLOPS](http://en.wikipedia.org/wiki/FLOPS) and Hz; it's like comparing RPM to velocity. Besides, you already mentioned memory access, which is a major bottleneck that doesn't necessarily correlate with the CPU speed. What makes you so sure you're not accessing *"memory all over the place"*? – smocking Jul 12 '12 at 06:27
  • @untraceable: what makes you think that preallocation is not an issue? it's likely that with 1M numbers and a growth rate of 1.5 for your vector you'll have about 30 reallocs during that inner loop, which require allocating, deallocating and copying all that data. That does not sound insignificant to me. Though it will not explain the full 7s, it is important. – KillianDS Jul 12 '12 at 06:32
  • @smocking: "What makes you so sure you're not accessing 'memory all over the place'?" - The textual representation of a floating point number fits into 20 bytes or so, less than a cache line. In my understanding, parsing numbers is an extremely memory-local affair. – untraceable Jul 12 '12 at 06:33
  • @untraceable "The textual representation of a floating point number fits into 20 bytes or so, less than a cache line." But a million of these in a string could be scattered over your memory, not all of which will be cached, nor indeed paged in, at once. There's no guarantee a compiler will "pack" your string into consecutive memory. – Turix Jul 12 '12 at 06:39
  • @KillianDS: As already mentioned, I profiled the program and the reallocation takes at most 0.1% of the runtime. – untraceable Jul 12 '12 at 06:39
  • @Turix: actually, since the latest C++ standard, strings are contiguous (and I think they were almost always implemented like that anyway). But access to the `std::vector` could always invalidate cache lines. – KillianDS Jul 12 '12 at 06:40
  • @untraceable: OK this isn't going anywhere. If you're really stuck with parsing doubles and want to speed it up, you can take a look at the assembly library [here](http://www.ray.masmcode.com/fpu.html) (specifically `FpuAtoFL.asm`) and/or write your own based on it. – smocking Jul 12 '12 at 06:40
  • @KillianDS: Thanks, I didn't know that. (Although even with a consecutive string, it probably won't all be cached or paged at once.) – Turix Jul 12 '12 at 06:42
  • 1
    @untraceable: just to make sure... this is **optimized** code, right ? – Matthieu M. Jul 12 '12 at 07:24
  • Despite calling it "STL", the question in its current form "Why is iostream slow?" has many duplicates. At the time of this writing, I'm not sure there is much ground being gained here. – Drew Dormann Jul 12 '12 at 12:56
  • possible duplicate of [Why is reading lines from stdin much slower in C++ than Python?](http://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python) – Drew Dormann Jul 12 '12 at 12:57
  • @DrewDormann: While I admit there is some code reuse, I don't think I could ever liken `stringstream` with an `iostream`. – jxh Jul 12 '12 at 16:43
  • 1
    @user315052: Can you explain what you mean? `stringstream` isn't just **like** an `iostream`. It **is** an `iostream`. – Drew Dormann Jul 12 '12 at 16:48
  • @DrewDormann: Thank you for the link. Unfortunately the suggestion to turn off the syncronization with stdio did not have any effect. This would have surprized me in any case, as stringstream is entirely in memory and should not be affected by buffering. – untraceable Jul 12 '12 at 16:55
  • @DrewDormann: As untraceable said, a `stringstream` does not read from a device. It has the interface of `iostream`, but not the file descriptor backend of `std::cin`/`std::cout`. – jxh Jul 12 '12 at 17:03
  • @user315052: I see what you're saying. Yes, there were SO posts that were closer to this question, but they too were closed as duplicates. Example http://stackoverflow.com/questions/5830868 – Drew Dormann Jul 12 '12 at 17:15
  • @DrewDormann: I must admit, I think SO is over aggressive on duplicates. The answer that is suggested as the duplicate isn't complete for the question. The file should be mapped into memory first. – jxh Jul 12 '12 at 17:31
  • @DrewDormann: Anyway, the question really is, why is untraceable's system so slow? – jxh Jul 12 '12 at 17:35
  • Boost Spirit's `double_` parser crashes with `array out of bounds` for any large value like `"1e400"` – Inverse Nov 26 '13 at 01:00

5 Answers5

5

On my Linux VM running on a 1.6 GHz i7, it takes less than half a second. My conclusion is that the parsing is not as slow as you are observing it to be. There must be some other artifact that you are measuring to cause your observation to be so vastly different from mine. So that we can be more sure we are comparing apples to apples, I'll provide what I did.

Edit: On my Linux system, I have g++ 4.6.3, compiled with -O3. Since I don't have MS or Intel compilers, I used cygwin g++ 4.5.3, also compiled with -O3. On Linux, I got the following output: Another fact is my Windows 7 is 64 bit, as is my Linux VM. I believe cygwin only runs in 32 bit mode.

elapsed: 0.46 stringstream
elapsed: 0.11 strtod

On cygwin, I got the following:

elapsed: 1.685 stringstream
elapsed: 0.171 strtod

I speculate that the difference between cygwin and Linux performance has something to do with MS library dependencies. Note that the cygwin environment is just on the host machine of the Linux VM.

This is the routine I timed that used istringstream.

std::vector<double> parseVals (std::string &s) {
    std::istringstream ss(s);
    std::vector<double> vals;
    vals.reserve(1000000);
    double val;
    while (ss >> val) vals.push_back(val);
    return vals;
}

This is the routine I timed that used strtod.

std::vector<double> parseVals2 (char *s) {
    char *p = 0;
    std::vector<double> vals;
    vals.reserve(1000000);
    do {
        double val = strtod(s, &p);
        if (s == p) break;
        vals.push_back(val);
        s = p+1;
    } while (*p);
    return vals;
}

This is the routine I used to populate the string with one million doubles.

std::string one_million_doubles () {
    std::ostringstream oss;
    double x = RAND_MAX/(1.0 + rand()) + rand();
    oss << x;
    for (int i = 1; i < 1000000; ++i) {
        x = RAND_MAX/(1.0 + rand()) + rand();
        oss << " " << x;
    }
    return oss.str();
}

This is the routine I used to do the timing:

template <typename PARSE, typename S>
void time_parse (PARSE p, S s, const char *m) {
    struct tms start;
    struct tms finish;
    long ticks_per_second;
    std::vector<double> vals_vec;

    times(&start);
    vals_vec = p(s);
    times(&finish);
    assert(vals_vec.size() == 1000000);
    ticks_per_second = sysconf(_SC_CLK_TCK);
    std::cout << "elapsed: "
              << ((finish.tms_utime - start.tms_utime
                   + finish.tms_stime - start.tms_stime)
                  / (1.0 * ticks_per_second))
              << " " << m << std::endl;
}

And, this was the main function:

int main ()
{
    std::string vals_str;

    vals_str = one_million_doubles();
    std::vector<char> s(vals_str.begin(), vals_str.end());

    time_parse(parseVals, vals_str, "stringstream");
    time_parse(parseVals2, &s[0], "strtod");
}
jxh
  • 69,070
  • 8
  • 110
  • 193
  • This is odd. I do exactly the same, but it takes 25 times longer (GHz adjusted). I compiled my code in release mode with full optimizations using the newest versions of Microsoft and Intel compilers. Both produce the same result. Can this be a Windows issue? Some kind of expensive locale checks? – untraceable Jul 12 '12 at 07:03
  • @untraceable: I don't have MS or Intel compilers. G++ 4.5 on cygwin gives me results about 4 times worse than the Linux times (1.7 seconds). A `strtod` version is about 10 times faster than that on cygwin, and about 4 times faster on Linux. – jxh Jul 12 '12 at 08:25
  • I tried strtod and it gave me a 25x speedup (which is still slower than on your system). If i don't find anything else, i'll definitely go with strtod. But something safer would be nice ;-) – untraceable Jul 12 '12 at 08:56
  • I've tried g++ on an Ubuntu VirtualBox. I still cannot reproduce your results, but it comes close: strtod has the same performance as on Windows, but stringstream gets 12x faster. – untraceable Jul 12 '12 at 10:39
  • @untraceable: Another point is both my OS's are 64 bit, but I think cygwin only runs in 32 bit mode. – jxh Jul 12 '12 at 22:11
  • @untraceable: I did think of something else you can try. Can you try disabling your virus protection software when you run your test on windows? – jxh Jul 13 '12 at 15:46
  • Virus scanner didn't make any difference. – untraceable Jul 14 '12 at 03:21
2

Your overhead is in both repeated instantiation of the std::stringstream and in the parsing itself. If your numbers are plain and not using any locale dependent formatting, then I suggest #include <cstdlib> and std::strtod().

wilx
  • 17,697
  • 6
  • 59
  • 114
  • 3
    or in C++11 you can use [std::stod](http://en.cppreference.com/w/cpp/string/basic_string/stof) – KillianDS Jul 12 '12 at 06:33
  • 2
    @KillianDS: Unfortunately... it could well be slower. `stod` starts parsing from the beginning of the string, so even though it gives you the index of the first not-parsed character, popping the parsed characters from the string to call `stod` back will cause a huge movement in memory. TL;DR: `stod` is moronic. – Matthieu M. Jul 12 '12 at 07:28
  • @KillianDS: (or perhaps that string not supporting popping in O(1) is the real design issue, but anyway the interaction between the two was not well-thought) – Matthieu M. Jul 12 '12 at 09:49
  • @MatthieuM true, strod should work on splitted version of the string in this case (as it lacks an offset). But that would likely not be a real improvement. – KillianDS Jul 12 '12 at 11:40
  • @KillianDS: well, if you have to pop the characters from the string in an O(N) fashion (at each `double` converted), you are looking to 1 million reallocation and memory transfer. I believe this would outright kill the performances. – Matthieu M. Jul 12 '12 at 11:56
  • 1
    @KillianDS: `std::stod` should have used iterators for the function parameters. – jxh Jul 12 '12 at 19:47
1

Converting string to double is slow because your Corei5 CPU does not have that conversion operator built in.

While that CPU natively can convert a short to a float to an int at comparatively faster speeds, the conversion you describe must be done step-by-step, analyzing each character and deciding if it's part of the double and how.

What you're observing is representative of the actual work that needs to be done, considering that each double may look like -.0 or INF or 4E6 or -NAN. It may need to be truncated, it probably needs to be approximated and it may not be a valid double at all.

Drew Dormann
  • 59,987
  • 13
  • 123
  • 180
  • 1
    I understand that the standard library considers lots of special cases. But these are just a few branches at parse time. This does not explain the 25000 cycles. – untraceable Jul 12 '12 at 06:21
0

This is a pretty involved task for the parsing. To parse a double of has to match either a decimal or a floating point number then it has to extract this string and do the actual string conversion. This means that for each double in your string you are going over each double at least twice plus any other functionality that is done to get to the next double. The other part as mentioned is that a vector when it resizes is not the most efficient. But, it is just slow to parse and convert strings.

sean
  • 3,955
  • 21
  • 28
  • It's much more than decimal or floating point numbers. There's all sorts of formats the parser has to consider: scientific notation (e.g. `10E-2`), hex in upper or lower case (`0x023Adcc`) and NaN to name a few. – smocking Jul 12 '12 at 06:18
0

You construct a stringstream object every time you call that function, which is potentially very expensive.

However, we don't have enough information to answer your question. Are you compiling with optimizations turned on all the way? Is your function being inlined, or is there a function call with every invocation?

For a suggestion on how to speed things up, you should consider boost::lexical_cast<double>(str)

David Stone
  • 26,872
  • 14
  • 68
  • 84