0

I would like to read in a file like this:

13.3027 29.2191 2.39999
13.3606 29.1612 2.39999
13.3586 29.0953 2.46377
13.4192 29.106 2.37817

It has more than 1mio lines.

My current cpp code is:

loadCloud(const string &filename, PointCloud<PointXYZ> &cloud)
{
    print_info("\nLoad the Cloud .... (this takes some time!!!) \n");
    ifstream fs;
    fs.open(filename.c_str(), ios::binary);
    if (!fs.is_open() || fs.fail())
    {
        PCL_ERROR(" Could not open file '%s'! Error : %s\n", filename.c_str(), strerror(errno));
        fs.close();
        return (false);
    }

    string line;
    vector<string> st;

    while (!fs.eof())
    {
        getline(fs, line);
        // Ignore empty lines
        if (line == "") 
        {
            std::cout << "  this line is empty...." << std::endl;
            continue;
        }

        // Tokenize the line
        boost::trim(line);
        boost::split(st, line, boost::is_any_of("\t\r "), boost::token_compress_on);

        cloud.push_back(PointXYZ(float(atof(st[0].c_str())), float(atof(st[1].c_str())), float(atof(st[2].c_str()))));
    }
    fs.close();
    std::cout<<"    Size of loaded cloud:   " << cloud.size()<<" points" << std::endl;
    cloud.width = uint32_t(cloud.size()); cloud.height = 1; cloud.is_dense = true;
    return (true);
}

Reading this file currently takes really long. I would like to speed this up any ideas how to do that?

sqp_125
  • 538
  • 6
  • 21
  • Is the file structure always like above? – dodekja Apr 11 '19 at 06:34
  • No the file could also contain 6 numbers (xyz, rgb). But the whole file has eighter the structure of 3 numbers per line or 6 numbers per line. – sqp_125 Apr 11 '19 at 06:36
  • 6
    Read this please: [Why is iostream::eof inside a loop condition considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong) – jrok Apr 11 '19 at 06:37
  • Hm but what is my data then in the above example if I do like this: while(fs>>data){...} – sqp_125 Apr 11 '19 at 06:50
  • You can do `while(getline(fs, line)` instead. – jrok Apr 11 '19 at 06:56
  • 1
    The first thing you should consider is to read whole blocks of data into your memory and process them. Since your file may be very large, mapping the whole file doesn't seem to work. But you can read large chunks into memory, process them - maybe even in parallel, if that helps. – UniversE Apr 11 '19 at 07:01
  • **Before** attempting to improve your code, profile it, i.e. find out where the time is spent. I guess that the reading takes very little and your analysis most of the time. You may want to use standard methods for reading numbers from a stream, since these methods are likely near-optimal. – Walter Apr 11 '19 at 07:35

2 Answers2

2

You can just read the numbers instead of the whole line plus parsing, as long as the numbers always come in sets of three.

void readFile(const std::string& fileName)
{
    std::ifstream infile(fileName);

    float vertex[3];
    int coordinateCounter = 0;

    while (infile >> vertex[coordinateCounter])
    {
        coordinateCounter++;
        if (coordinateCounter == 3)
        {
            cloud.push_back(PointXYZ(vertex[0], vertex[1], vertex[2]));
            coordinateCounter = 0;
        }
    }
}
J.R.
  • 1,880
  • 8
  • 16
  • 1
    I am unable to understand how this will handle cases where there are 6 numbers in the line. – anand_v.singh Apr 11 '19 at 06:56
  • Sorry, I just HAVE to downvote this. This is not how proper file handling works. You almost never assume anything about the contents of a file. You always have to check, with maybe one or two exceptions among a million of use cases. – UniversE Apr 11 '19 at 06:58
  • Nevertheless this is really fast! I have now 10 times the speed compared to my code – sqp_125 Apr 11 '19 at 07:02
  • 1
    @anand_v.singh It just skips through white space, 6 numbers are not a problem. – J.R. Apr 11 '19 at 07:08
  • @UniversE I have qualified that numbers must come in sets of three. sqp_125 can judge whether that is warranted. Feel free to suggest some additional error checking. – J.R. Apr 11 '19 at 07:10
  • The above code also works for 6 numbers for me the last 3 numbers are simply skipped :) – sqp_125 Apr 11 '19 at 07:16
  • @J.R. no you have not qualified that the numbers come in set of three, you assume they do. If e.g. in one line there is a fourth number (for whatever reason), it completely corrupts your data. If for whatever reason you have a locale problem and get 42,5 instead of 42.5 as floating point input, your algorithm simply cancels prematurely having the half file read and no one will ever notice. Glad I will not be the person who needs to debug this some day. – UniversE Apr 11 '19 at 07:23
  • @UniversE I probably wasn't very clear before; I have qualified (as in "limited") the algorithm to apply to cases with sets of three coordinates, meaning that you cannot omit the 'z' coordinate, for example, and assume it to be zero; all three coordinates must be present. Then it does not matter whether there are 1, 4, 6, or any other number of coordinates in one line as long as there is a sequence of x,y and z triples. This minimal code sample does not do any error checking nor consider internationalization; I have left this to the OP. After all, the question was about "the fastest way". – J.R. Apr 11 '19 at 19:51
1

Are you running optimised code? On my machine your code reads a million values in 1800ms.

The trim and the split are probably taking most of the time. If there is white space at the beginning of the string trim has to copy the whole string contents to erase the first characters. split is creating new string copies, you can optimise this by using string_view to avoid the copies.

As your separators are white space you can avoid all the copies with code like this:

bool loadCloud(const string &filename, std::vector<std::array<float, 3>> &cloud)
{
    ifstream fs;
    fs.open(filename.c_str(), ios::binary);
    if (!fs)
    {
        fs.close();
        return false;
    }

    string line;
    vector<string> st;

    while (getline(fs, line))
    {
        // Ignore empty lines
        if (line == "")
        {
            continue;
        }

        const char* first = &line.front();
        const char* last = first + line.length();
        std::array<float, 3> arr;
        for (float& f : arr)
        {
            auto result = std::from_chars(first, last, f);
            if (result.ec != std::errc{})
            {
                return false;
            }
            first = result.ptr;
            while (first != last && isspace(*first))
            {
                first++;
            }
        }
        if (first != last)
        {
            return false;
        }

        cloud.push_back(arr);
    }
    fs.close();
    return true;
}

On my machine this code runs in 650ms. About 35% of the time is used by getline, 45% by parsing the floats, the remaining 20% is used by push_back.

A few notes:

  1. I've fixed the while(!fs.eof()) issue by checking the state of the stream after calling getline
  2. I've changed the result to an array as your example wasn't a mcve so I didn't have a definition of PointCloud or PointXYZ, its possible that these types are the cause of your slowness.
  3. If you know the number of lines (or at least an approximation) in advance then reserving the size of the vector would improve performance
Alan Birtles
  • 32,622
  • 4
  • 31
  • 60