0

I have a lexer that consumes a file character by character, looking for tokens. I tried two methods for NextChar(), the first reads directly from ifstream through ifstream::get(ch), and the second loads the whole file into a std::stringstream to avoid disk I/O overhead.

get() method:

inline void Scanner::NextChar()
{
    inputStream.get(unscannedChar);
    currentCol++;

    while (unscannedChar == ' ')
    {
        inputStream.get(unscannedChar);
        currentCol++;
    }

    if (inputStream.eof()) {
        unscannedChar = std::char_traits<char>::eof();
    }

}

stringstream method: while loading the file into stringstream takes no time, indexing is extremely slow.

inline void Scanner::NextChar()
{
    unscannedChar = buffer.str()[counter++];
    currentCol++;

    while (unscannedChar == ' ')
    {
        unscannedChar = buffer.str()[counter++];
        currentCol++;
    }

    
    if (counter > buffer.str().size())
    {
        unscannedChar = std::char_traits<char>::eof();
    }

}

I expected the 2nd method to be much faster, since it's iterating over characters in memory not on disk, but I was wrong, and here are some of my tests:

| tokens    | ifstream::get()   | stringstream::str()[]     |
|--------   |-----------------  |-----------------------    |
| 5         | 0.001 (sec)       | 0.001 (sec)               |
| 800       | 0.002 (sec)       | 0.295 (sec)               |
| 21000     | 0.044 (sec)       | 693.403 (sec)             |    

NextChar() is extremely important for my project, and I need to make it as fast as possible and I would appreciate explaining why am I having the previous results?

Algo
  • 841
  • 1
  • 14
  • 26
  • 4
    `stringstream` is likely creating a defensive copy every time `str` is called. – vandench Sep 22 '21 at 19:41
  • 4
    Doesn’t `buffer.str()`, which you call repeatedly, return a new string of the buffer contents every time it’s called? https://www.cplusplus.com/reference/sstream/stringstream/str/ – Dave Newton Sep 22 '21 at 19:41
  • 1
    I am afraid the loop with `inputStream.get(unscannedChar);` will hang if the space is a last char in the file. – 273K Sep 22 '21 at 19:42
  • You should probably tag it with a platform and specific compiler given the nature of the issue. It would also be helpful if you updated your question to be a [mre][ – Allan Wind Sep 22 '21 at 19:49
  • @AllanWind I don’t think that’d change the contract of `str()a` though. – Dave Newton Sep 22 '21 at 19:51
  • @DaveNewton Fair point. Do we know it's the only issue? – Allan Wind Sep 22 '21 at 19:53
  • @AllanWind Not yet, but it’s certainly the biggest difference between the two methods—creating a new string of an entire buffer in a loop is almost always Very Bad for performance, and creating it once at the beginning of the method would likely demonstrate this immediately. And moving it out of `NextChar` entirely would likely end up being essentially equivalent or faster. – Dave Newton Sep 22 '21 at 19:56
  • No compiler stated, no compiler options stated, no [mcve]. All of these things are missing from the question to come up with a valid conclusion as to the issue. A good percentage of "why is this C++ code slower than..." questions wind up being closed because it is discovered that the compiler options were not for an optimized build, or the code just plain was wrong, the timing code was off, etc. – PaulMcKenzie Sep 22 '21 at 20:02
  • Please remember to include testing code when taking about performance. There are many traps when measurements are done and they are so subtle that even advanced developers make mistake which falsify results. In this case first two comments covered the topic. – Marek R Sep 22 '21 at 20:02
  • 1
    Unrelated: Consider: `do { unscannedChar = inputStream.get(); currentCol++; } while(unscannedChar == ' ');` in your first version. That should remove some code duplication and exit the loop even if the last char is a space. You also get the `unscannedChar = std::char_traits::eof();` for free. – Ted Lyngmo Sep 22 '21 at 20:27
  • If you need indexed access, better to replace `std::stringstream` with just `std::string`. See [How do I read an entire file into a std::string in C++?](https://stackoverflow.com/questions/116038/). – Remy Lebeau Sep 22 '21 at 20:53
  • It's almost impossible on a modern OS to ever "iterate over characters on disk". You're using buffered I/O (ifstream) on top of buffered I/O (the OS device buffer). Adding a third buffer on top wouldn't improve anything even if you implemented it correctly. – Useless Sep 22 '21 at 22:56

2 Answers2

4

std::ifstream is already doing its own internal buffering, so it's not like it has to go out and wait for the hard drive to respond every time you call get(ch); 99.99% of the time, it already has your next character available in its internal read-buffer and just has to do a one-byte copy to hand it over to your code.

Given that, there's no additional speedup to be gained by copying the entire file into your own separate RAM buffer; indeed, doing that is likely to make things slower since it means you can't start parsing the data until after the entire file has been read into RAM (whereas with ifstream's smaller read-ahead buffer, your code can start parsing chars as soon as the first part of the file has been loaded, and parsing can continue to some extent in parallel with disk reads after that)

On top of that, stringstream::str() is returning a string object by-value every time you call it, which can be very expensive if the returned string is large. (i.e. you are making an in-RAM copy of the file's contents, and then throwing it away, for every character you parse!)

Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234
0

According to my experience, stringstream is slow. See for example:

https://github.com/TheNitesWhoSay/RareCpp/issues/28

Therefore I never use it. If performance is important, consider flex and bison.

https://en.wikipedia.org/wiki/GNU_Bison

For simple format IMHO the fastest way to parse is using C interface.

  • Rather than a blanket never, I use `stringstream` when I think it will simplify the code and won't have a significant performance impact. If profiling says I'm wrong, so be it. It goes into the queue with bugs and other enhancements. – user4581301 Sep 22 '21 at 20:50