C++: How to read a lot of data from formatted text files into program?

Question

I'm writing a CFD solver for specific fluid problems. So far the mesh is generated every time running the simulation, and when changing geometry and fluid properties,the program needs to be recompiled.

For small-sized problem with low number of cells, it works just fine. But for cases with over 1 million cells, and fluid properties needs to be changed very often, It is quite inefficient.

Obviously, we need to store simulation setup data in a config file, and geometry information in a formatted mesh file.

Simulation.config file

% Dimension: 2D or 3D
N_Dimension= 2
% Number of fluid phases
N_Phases=  1
% Fluid density (kg/m3)
Density_Phase1= 1000.0
Density_Phase2= 1.0
% Kinematic viscosity (m^2/s)
Viscosity_Phase1=  1e-6
Viscosity_Phase2=  1.48e-05
...

Geometry.mesh file

% Dimension: 2D or 3D
N_Dimension= 2
% Points (index: x, y, z)
N_Points= 100
x0 y0
x1 y1
...
x99 y99
% Faces (Lines in 2D: P1->p2)
N_Faces= 55
0 2
3 4
...
% Cells (polygons in 2D: Cell-Type and Points clock-wise). 6: triangle; 9: quad
N_Cells= 20
9 0 1 6 20
9 1 3 4 7
...
% Boundary Faces (index)
Left_Faces= 4
0
1
2
3
Bottom_Faces= 6
7
8
9
10
11
12
...

It's easy to write config and mesh information to formatted text files. The problem is, how do we read these data efficiently into program? I wonder if there is any easy-to-use c++ library to do this job.

Since your input is text, there is no efficient method (or the very efficient methods don't apply). For example, with the numbers, they have to be parsed and then converted to internal format. Now, if your file was in a binary format, that would be more efficient. — Thomas Matthews, Jun 28 '19 at 19:33
There are many data formats that are human readable, such as XML, HTML and INI. There are libraries to input data in these formats; search the internet. — Thomas Matthews, Jun 28 '19 at 19:34
Please do not reinvent the wheel. Use an existing file format instead of a custom one. Have a look at https://en.m.wikipedia.org/wiki/Polygon_mesh#File_Formats , https://www.hdfgroup.org/solutions/hdf5 and https://en.m.wikipedia.org/wiki/YAML . These formats will supply libraries for storing and reading, and usually put way more effort into efficiency and error tolerance than you ever will. Don't slow yourself down by implementing it your own... — jan.sende, Jun 28 '19 at 19:44
Just use a wavefront .obj and one of the MANY libraries available to read & write that format. Solves your issue and also allows you to use any off-the-shelf CAD program to build your models. — 3Dave, Jun 28 '19 at 20:57
@3Dave Physical boundaries information like inlet, outlet, walls, etc. are very important for simulation, some existing mesh formats (obj, off, stl) do not allow to include this i bet. — KOF, Jun 29 '19 at 12:48
For one version of my GPU FDTD simulator, I stored dielectric values, monitor locations, etc. in textures that were applied to the mesh. During initialization, I voxelized the entire scene to generate the Yee grid using those values, generated monitors, etc. It worked pretty well but required some custom tooling. — 3Dave, Jun 29 '19 at 16:13
Kinda relevant offtopic: I have enjoyed and learned a lot from this talk "Optimising a small real-world C++ application - Hubert Matthews [ACCU 2019]" https://www.youtube.com/watch?v=fDlE93hs_-U — R2RT, Jul 06 '19 at 18:41

score 5 · Answer 1 · answered Jun 28 '19 at 20:19

5

Well, well You can implement your own API based on a finite elements collection, a dictionary, some Regex and, after all, apply bet practice according to some international standard.

Or you can take a look on that:

GMSH_IO

OpenMesh:

I just used OpenMesh in my last implementation for C++ OpenGL project.

answered Jun 28 '19 at 20:19

JOSMAR BARBOSA - M4NOV3Y

91
5

OpenMesh provides some readers/writers for the obj, off, ply and stl mesh formats. What I need is a simple parser for the config file: name= value entries, and the simple custom mesh format. – KOF Jun 30 '19 at 16:04

einpoklum · Answer 2 · 2019-07-01T13:12:25.770

As a first-iteration solution to just get something tolerable - take @JosmarBarbosa's suggestion and use an established format for your kind of data - which also probably has free, open-source libraries for you to use. One example is OpenMesh developed at RWTH Aachen. It supports:

Representation of arbitrary polygonal (the general case) and pure triangle meshes (providing more efficient, specialized algorithms)

Explicit representation of vertices, halfedges, edges and faces.

Fast neighborhood access, especially the one-ring neighborhood (see below).

[Customization]

But if you really need to speed up your mesh data reading, consider doing the following:

Separate the limited-size meta-data from the larger, unlimited-size mesh data;
Place the limited-size meta-data in a separate file and read it whichever way you like, it doesn't matter.
Arrange the mesh data as several arrays of fixed-size elements or fixed-size structures (e.g. cells, faces, points, etc.).
Store each of the fixed-width arrays of mesh data in its own file - without using streaming individual values anywhere: Just read or write the array as-is, directly. Here's an example of how a read would look. Youll know the appropriate size of the read either by looking at the file size or the metadata.

Finally, you could avoid explicitly-reading altogether and use memory-mapping for each of the data files. See

fastest technique to read a file into memory?

Notes/caveats:

If you write and read binary data on systems with different memory layout of certain values (e.g. little-endian vs big-endian) - you'll need to shuffle the bytes around in memory. See also this SO question about endianness.
It might not be worth it to optimize the reading speed as much as possible. You should consider Amdahl's law, and only optimize it to a point where it's no longer a significant fraction of your overall execution time. It's better to lose a few percentage points of execution time, but get human-readable data files which can be used with other tools supporting an established format.

score 4 · Answer 3 · 2019-06-30T19:57:21.960

In the following answear I asume:

That if the first character of a line is % then it shall be ignored as a comment.
Any other line is structured exactly as follows: identifier= value.

The code I present will parse a config file following the mentioned assumptions correctly. This is the code (I hope that all needed explanation is in comments):

#include <fstream>          //required for file IO
#include <iostream>         //required for console IO
#include <unordered_map>    //required for creating a hashtable to store the identifiers

int main()
{
    std::unordered_map<std::string, double> identifiers;

    std::string configPath;

    std::cout << "Enter config path: ";
    std::cin >> configPath;

    std::ifstream config(configPath);   //open the specified file
    if (!config.is_open())              //error if failed to open file
    {
        std::cerr << "Cannot open config file!";
        return -1;
    }

    std::string line;
    while (std::getline(config, line))  //read each line of the file
    {
        if (line[0] == '%') //line is a comment
            continue;

        std::size_t identifierLenght = 0;
        while (line[identifierLenght] != '=')
            ++identifierLenght;
        identifiers.emplace(
            line.substr(0, identifierLenght),
            std::stod(line.substr(identifierLenght + 2))
        ); //add entry to identifiers
    }

    for (const auto& entry : identifiers)
        std::cout << entry.first << " = " << entry.second << '\n';
}

After reading the identifiers you can, of course, do whatever you need to do with them. I just print them as an example to show how to fetch them. For more information about std::unordered_map look here. For a lot of very good information about making parsers have a look here instead.

If you want to make your program process input faster insert the following line at the beginning of main: std::ios_base::sync_with_stdio(false). This will desynchronize C++ IO with C IO and, in result, make it faster.

score 4 · Answer 4 · answered Jul 01 '19 at 23:08

Assuming:

you don't want to use an existing format for meshes
you don't want to use a generic text format (json, yml, ...)
you don't want a binary format (even though you want something efficient)

In a nutshell, you really need your own text format.

You can use any parser generator to get started. While you could probably parse your config file as it is using only regexps, they can be really limited on the long run. So I'll suggest a context-free grammar parser, generated with Boost spirit::x3.

AST

The Abstract Syntax Tree will hold the final result of the parser.

#include <string>
#include <utility>
#include <vector>
#include <variant>

namespace AST {
    using Identifier = std::string; // Variable name.
    using Value = std::variant<int,double>; // Variable value.
    using Assignment = std::pair<Identifier,Value>; // Identifier = Value.
    using Root = std::vector<Assignment>; // Whole file: all assignments.
}

Parser

Grammar description:

#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/home/x3.hpp>

namespace Parser {
    using namespace x3;

    // Line: Identifier = value
    const x3::rule<class assignment, AST::Assignment> assignment = "assignment";
    // Line: comment
    const x3::rule<class comment> comment = "comment";
    // Variable name
    const x3::rule<class identifier, AST::Identifier> identifier = "identifier";
    // File
    const x3::rule<class root, AST::Root> root = "root";
    // Any valid value in the config file
    const x3::rule<class value, AST::Value> value = "value";

    // Semantic action
    auto emplace_back = [](const auto& ctx) {
        x3::_val(ctx).emplace_back(x3::_attr(ctx));
    };

    // Grammar
    const auto assignment_def = skip(blank)[identifier >> '=' >> value];
    const auto comment_def = '%' >> omit[*(char_ - eol)];
    const auto identifier_def = lexeme[alpha >> +(alnum | char_('_'))];
    const auto root_def = *((comment | assignment[emplace_back]) >> eol) >> omit[*blank];
    const auto value_def = double_ | int_;

    BOOST_SPIRIT_DEFINE(root, assignment, comment, identifier, value);
}

Usage

// Takes iterators on string/stream...
// Returns the AST of the input.
template<typename IteratorType>
AST::Root parse(IteratorType& begin, const IteratorType& end) {
    AST::Root result;
    bool parsed = x3::parse(begin, end, Parser::root, result);
    if (!parsed || begin != end) {
        throw std::domain_error("Parser received an invalid input.");
    }
    return result;
}

Live demo

Evolutions

To change where blank spaces are allowed, add/move x3::skip(blank) in the xxxx_def expressions.
Currently the file must end with a newline. Rewriting the root_def expression can fix that.
You'll certainly want to know why the parsing failed on invalid inputs. See the error handling tutorial for that.

You're just a few rules away from parsing more complicated things:

//                                               100              X_n        Y_n
const auto point_def = lit("N_Points") >> ':' >> int_ >> eol >> *(double_ >> double_ >> eol)

score 2 · Answer 5 · answered Jun 30 '19 at 20:29

If you don't need specific text file format, but have a lot of data and do care about performance, I recommend using some existing data serialization frameworks instead.

E.g. Google protocol buffers allow efficient serialization and deserialization with very little code. The file is binary, so typically much smaller than text file, and binary serialization is much faster than parsing text. It also supports structured data (arrays, nested structs), data versioning, and other goodies.

https://developers.google.com/protocol-buffers/

C++: How to read a lot of data from formatted text files into program?

5 Answers5

AST

Parser

Usage

Live demo

Evolutions