5

I have a C-string which contains a list of floating numbers separated by commas and spaces. Each pair of numbers is separated by one (or more) spaces and represents a point where the x and y fields are separated by a comma (and optionally by spaces).

" 10,9 2.5, 3   4 ,150.32 "

I need to parse this string in order to fill a list of Point(x, y).
Following is my current implementation:

const char* strPoints = getString();
std::istringstream sstream(strPoints);

float x, y;
char comma;

while (sstream >> x >> comma >> y)
{
   myList.push(Point(x, y));
}

Since I need to parse a lot (up to 500,000) of these strings I'm wondering if there is a faster solution.

Drew Dormann
  • 59,987
  • 13
  • 123
  • 180
Nick
  • 10,309
  • 21
  • 97
  • 201
  • 4
    This is probably better suited for http://codereview.stackexchange.com/ – Baum mit Augen Jan 16 '15 at 19:11
  • 1
    `boost::lexical_cast()` [seems to parse floats several times faster](http://www.boost.org/doc/libs/1_55_0/doc/html/boost_lexical_cast/performance.html) than `stream >>` if you can rule out exceptional data like infinity, NaN, and scientific notation. – Drew Dormann Jan 16 '15 at 19:15
  • 2
    How long does your technique take? Is it more than a minute? How much time are you willing to spend optimizing this to save a minute? – wallyk Jan 16 '15 at 19:18
  • @wallyk A minute?!? It takes about 10 seconds.. too much. I'd like to decrease the time so that it takes 1 second. – Nick Jan 16 '15 at 20:30
  • 1
    Why is 1 second reasonable and 10 seconds "too much"? How are you deciding that? – wallyk Jan 16 '15 at 20:50
  • @wallyk Because you, as user, can wait your application response for about one second, but not much more. The application I'm developing has to render the graphics defined in a xml-based file (SVG). – Nick Jan 16 '15 at 20:53
  • 1
    "Fast" and "stringstream" don't go together. – Neil Kirk Jan 19 '15 at 01:55

3 Answers3

5

Look at Boost Spirit:

It supports NaN, positive and negative infinity just fine. Also it allows you to express the constraining grammar succinctly.

  1. Simple adaptation of the code

    Here is the adapted sample for your grammar:

    struct Point { float x,y; };
    typedef std::vector<Point> data_t;
    
    // And later:
    bool ok = phrase_parse(f,l,*(double_ > ',' > double_), space, data);
    

    The iterators can be any iterators. So you can hook it up with your C string just fine.

    Here's a straight adaptation of the linked benchmark case. This shows you how to parse from any std::istream or directly from a memory mapped file.

    Live On Coliru


  2. Further optimizations (strictly for C strings)

    Here's a version that doesn't need to know the length of the string up front (this is neat because it avoids the strlen call in case you didn't have the length available):

    template <typename OI>
    static inline void parse_points(OI out, char const* it, char const* last = std::numeric_limits<char const*>::max()) {
        namespace qi  = boost::spirit::qi;
        namespace phx = boost::phoenix;
    
        bool ok = qi::phrase_parse(it, last,
                *(qi::double_ >> ',' >> qi::double_) [ *phx::ref(out) = phx::construct<Point>(qi::_1, qi::_2) ],
                qi::space);
    
        if (!ok || !(it == last || *it == '\0')) {
            throw it; // TODO proper error reporting?
        }
    }
    

    Note how I made it take an output iterator so that you get to decide how to accumulate the results. The obvious wrapper to /just/ parse to a vector would be:

    static inline data_t parse_points(char const* szInput) {
        data_t pts;
        parse_points(back_inserter(pts), szInput);
        return pts;
    }
    

    But you can also do different things (like append to an existing container, that could have reserved a known capacity up front etc.). Things like this often allow truly optimized integration in the end.

    Here's that code fully demo-ed in ~30 lines of essential code:

    Live On Coliru

  3. Extra Awesome Bonus

    To show off the flexibility of this parser; if you just wanted to check the input and get a count of the points, you can replace the output iterator with a simple lambda function that increments a counter instead of adds a newly constructed point.

    int main() {
        int count = 0;
        parse_points( " 10,9 2.5, 3   4 ,150.32    ", boost::make_function_output_iterator([&](Point const&){count++;}));
        std::cout << "elements in sample: " << count << "\n";
    }
    

    Live On Coliru

    Since everything is inlined the compiler will notice that the whole Point doesn't need to be constructed here and eliminate that code: http://paste.ubuntu.com/9781055/

    The main function is seen directly invoking the very parser primitives. Handcoding the parser won't get you better tuning here, at least not without a lot of effort.

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • 2
    Nice, I didn't expect Spirit to be that fast. – Philipp Claßen Jan 18 '15 at 23:46
  • @PhilippClaßen To my dismay, I forgot that `qi::rule` does in fact present an erasure point (hence an inlining barrier) to the optimizer. So I simplified my answer a whole lot, and the result is likely a lot faster (as the assembly shows, mostly by being 2x shorter) – sehe Jan 19 '15 at 01:04
3

I got much better performance parsing out the points using a combination of std::find and std::strtof and the code wasn't much more complicated. Here's the test I ran:

#include <iostream>                                                                             
#include <sstream>                                                                              
#include <random>                                                                               
#include <chrono>                                                                               
#include <cctype>                                                                               
#include <algorithm>                                                                            
#include <cstdlib>                                                                              
#include <forward_list>                                                                         

struct Point { float x; float y; };                                                             
using PointList = std::forward_list<Point>;                                                     
using Clock = std::chrono::steady_clock;                                                        
using std::chrono::milliseconds;                                                                

std::string generate_points(int n) {                                                            
  static auto random_generator = std::mt19937{std::random_device{}()};                          
  std::ostringstream oss;                                                                       
  std::uniform_real_distribution<float> distribution(-1, 1);                                    
  for (int i=0; i<n; ++i) {                                                                     
    oss << distribution(random_generator) << " ," << distribution(random_generator) << "\t \n"; 
  }                                                                                             
  return oss.str();                                                                             
}                                                                                               

PointList parse_points1(const char* s) {                                                        
  std::istringstream iss(s);                                                                    
  PointList points;                                                                             
  float x, y;                                                                                   
  char comma;                                                                                   
  while (iss >> x >> comma >> y)                                                                
       points.push_front(Point{x, y});                                                          
  return points;                                                                                
}                                                                                               

inline                                                                                          
std::tuple<Point, const char*> parse_point2(const char* x_first, const char* last) {            
  auto is_whitespace = [](char c) { return std::isspace(c); };                                  
  auto x_last  = std::find(x_first, last, ',');                                                 
  auto y_first = std::find_if_not(std::next(x_last), last, is_whitespace);                      
  auto y_last  = std::find_if(y_first, last, is_whitespace);                                    
  auto x = std::strtof(x_first, (char**)&x_last);                                               
  auto y = std::strtof(y_first, (char**)&y_last);                                               
  auto next_x_first = std::find_if_not(y_last, last, is_whitespace);                            
  return std::make_tuple(Point{x, y}, next_x_first);                                            
}                                                                                               

PointList parse_points2(const char* i, const char* last) {                                      
  PointList points;                                                                             
  Point point;                                                                                  
  while (i != last) {                                                                           
    std::tie(point, i) = parse_point2(i, last);                                                 
    points.push_front(point);                                                                   
  }                                                                                             
  return points;                                                                                
}                                                                                               

int main() {                                                                                    
  auto s = generate_points(500000);                                                             
  auto time0 = Clock::now();                                                                    
  auto points1 = parse_points1(s.c_str());                                                      
  auto time1 = Clock::now();                                                                    
  auto points2 = parse_points2(s.data(), s.data() + s.size());                                  
  auto time2 = Clock::now();                                                                    
  std::cout << "using stringstream: "                                                           
            << std::chrono::duration_cast<milliseconds>(time1 - time0).count() << '\n';         
  std::cout << "using strtof: "                                                                 
            << std::chrono::duration_cast<milliseconds>(time2 - time1).count() << '\n';         
  return 0;                                                                                     
}                                  

outputs:

using stringstream: 1262
using strtof: 120
Ryan Burn
  • 2,126
  • 1
  • 14
  • 35
  • Please add white spaces, such as spaces, tabs, newlines during the generation phase. – Nick Jan 17 '15 at 18:32
  • 2
    The spirit solution [from my answer](http://stackoverflow.com/a/28014785/85371) is faster still. See *[compared live on coliru](http://coliru.stacked-crooked.com/a/bee327c0abf68636)* (even faster with suitably reserved containers instead of using `forward_list`: *[live on coliru](http://coliru.stacked-crooked.com/a/61d1195f5983e878)*). Also, it supports NaN and ±infinity. (It is also a lot more generic, at the cost of taking a lot longer to compile, I guess). – sehe Jan 19 '15 at 01:53
  • @sehe yes your solution is the fastest one, so I accept it as my answer. Anyway I'm able to optimize the mickb solution in order to get an implementation not so much slower than your even without using boost. I think I'l go without boost. +1 to mickb – Nick Jan 19 '15 at 19:56
0

You can first try to disable the sychronization with the C I/O:

std::ios::sync_with_stdio(false);

Source: Using scanf() in C++ programs is faster than using cin?

You can also try to use alternatives to iostream:

I think you should give the sync_with_stdio(false) a try. The other alternatives require more coding, and I'm not sure that you will win much (if any).

Community
  • 1
  • 1
Philipp Claßen
  • 41,306
  • 31
  • 146
  • 239
  • I do not have any significant speedup by using `std::ios::sync_with_stdio(false)` – Nick Jan 16 '15 at 20:33
  • 2
    The OP does not mention files at all. – wallyk Jan 16 '15 at 20:49
  • Instead of reading the whole SVG file into one "const char*" and then parsing it, it should be faster to parse it on the fly. – Philipp Claßen Jan 16 '15 at 21:33
  • @PhilippClaßen where did you get the SVG context from? Anyhoops my answer addresses this too (with a link to benchmarks of `std::istream&` vs. `mmap`) – sehe Jan 18 '15 at 22:35
  • @sehe OP mentioned it in a comment to his question. Yes, I also think that parsing directly from istream or mmap would speed things up. – Philipp Claßen Jan 18 '15 at 23:30
  • @PhilippClaßen ah. That comment was hidden so "find in page" didn't reveal it. Thanks – sehe Jan 18 '15 at 23:32