2

I don't what to copy string to trim it later. I parse csv file, my code:

while(std::getline(stream, line))
    {
        boost::tokenizer<boost::escaped_list_separator<char>> tok(line);
        std::for_each(tok.begin(), tok.end(), handler);
        
    }

parseCSV(file, [](const std::string& tok)
    {
        std::vector<SpiceSimulation::DataVector*> arrays;
        std::this_thread::sleep_for(std::chrono::milliseconds(500));
        std::cout << "\t-->" << tok << std::endl;
        //std::string cptoken = boost::trim_copy(tok);
        //Read Header Titles
        if(boost::starts_with(tok, "v"))
        {
            std::cout << "START WITH\n";
        }
        
    }); 

My file.csv:

time, vtime2, vtime3, vtime4 ...   

I get results with whitespaces Result: ["time"," vtime2"," vtime3"," vtime4"]

How can I rid of these whitespaces without copying? If I understand right tokenizer return result as basic_string it isn't a copy of original string

  • Why don't you want to copy the string (premature optimization alert). Also, so, just don't copy: https://www.boost.org/doc/libs/1_78_0/doc/html/boost/algorithm/trim.html – sehe Jan 04 '22 at 20:56
  • i can't call this function with token as argument, because of token is basic_string – Андрей Петров Jan 05 '22 at 10:34
  • That's perfectly fine. Trim takes any mutable reference to a Sequence: http://coliru.stacked-crooked.com/a/c63ba5193ad0c43e - Of course, the real problem could be (?) when `for_each` doesn't pass the token as a mutable reference. More reasons to not use that interface for your task. However, you never answered why you don't want to copy the string. – sehe Jan 05 '22 at 10:48
  • Only because of performance – Андрей Петров Jan 05 '22 at 11:14
  • Yeah. Prove it [profiling comes before optimizing]. It's not gonna fly. Your bottleneck is 100% tokenizing into strings in the first place, then. Did you see any of my examples? See e.g. http://coliru.stacked-crooked.com/a/dc5f0f1cfd91c456 (from https://stackoverflow.com/a/48997464/85371) or e.g. this trove of good practices: https://stackoverflow.com/a/48533015/85371 – sehe Jan 05 '22 at 11:22
  • Oh, actually somewhat more down the line: https://stackoverflow.com/questions/23699731/simplest-way-to-read-a-csv-file-mapped-to-memory/23703810#23703810 is a splendid example of actually tokenizing the CSV without minimal allocations. – sehe Jan 05 '22 at 11:24

1 Answers1

0

The tokenizer function has constructors

explicit escaped_list_separator(Char  e = '\\',
                                Char c = ',',Char  q = '\"')
  : escape_(1,e), c_(1,c), quote_(1,q), last_(false) { }

escaped_list_separator(string_type e, string_type c, string_type q)
  : escape_(e), c_(c), quote_(q), last_(false) { }

You can pass those:

    boost::escaped_list_separator<char> tf("\\", ", ", "\"");
    boost::tokenizer<boost::escaped_list_separator<char>> tok(line, tf);
    std::for_each(tok.begin(), tok.end(), handler);

But it doesn't exactly do what you expect:

Line: "time, vtime2, vtime3, vtime4 ...   "
        -->"time"
        -->""
        -->"vtime2"
START WITH
        -->""
        -->"vtime3"
START WITH
        -->""
        -->"vtime4"
START WITH
        -->"..."
        -->""
        -->""
        -->""

I would do this another way. Parsing != tokenizing. See e.g. https://stackoverflow.com/search?tab=newest&q=user%3a85371%20csv%20parser

sehe
  • 374,641
  • 47
  • 450
  • 633