1

I am working on a C++11 application that is supposed to ship as a single executable binary file. Optionally, users can provide their own CSV data files to be used by the application. To simplify things, assume each element is in format key,value\n. I have created a structure such as:

typedef struct Data {
    std::string key;
    std::string value;

    Data(std::string key, std::string value) : key(key), value(value) {}
} Data;

By default, the application should use data defined in a single header file. I've made a simple Python script to parse default CSV file and put it into header file like:

#ifndef MYPROJECT_DEFAULTDATA
#define MYPROJECT_DEFAULTDATA

#include "../database/DefaultData.h"

namespace defaults {
    std::vector<Data> default_data = {
        Data("SomeKeyA","SomeValueA"),
        Data("SomeKeyB","SomeValueB"),
        Data("SomeKeyC","SomeValueC"),

        /* and on, and on, and on... */

        Data("SomeKeyASFHOIEGEWG","SomeValueASFHOIEGEWG")
    }
}

#endif //MYPROJECT_DEFAULTDATA

The only problem is, that file is big. I'm talking 116'087 (12M) lines big, and it will probably be replaced with even bigger file in the future. When I include it, my IDE is trying to parse it and update indices. It slows everything down to the point where I can hardly write anything.

I'm looking for a way to either:

  1. prevent my IDE (CLion) from parsing it or
  2. make a switch in cmake that would use this file only with release executables or
  3. somehow inject data directly into executable
Jezor
  • 3,253
  • 2
  • 19
  • 43
  • Why not just let the exe take a command line parameter which states the file and read it at runtime? – doctorlove Nov 02 '16 at 14:57
  • @doctorlove as I've said "application [...] is supposed to ship as a single executable binary file". It does optionally take a command line parameter, but by default, the file should be inside executable. – Jezor Nov 02 '16 at 15:00
  • 1
    And you can't ship the actual "default value" CSV file along with the executable, so it can be read if no other data file is loaded? Then how about create a single long string containing the actual contents of the file, include it much like your vector, and that string is is parsed at startup? – Some programmer dude Nov 02 '16 at 15:04
  • 1
    So you want to give your users a single executable, but you need to do a different build per user to incorporate their data? (After they have given you the data file)? What problem are you really trying to solve? It is usual to store data in a data file/database (*for a reason*) – doctorlove Nov 02 '16 at 15:07
  • @Someprogrammerdude nope. Well, I could do that but application's performance is important in this case, and static vector initialization seems to be much faster. – Jezor Nov 02 '16 at 15:15
  • @doctorlove no, one build with default file embedded. User optionally specifies their own, external file to be parsed. – Jezor Nov 02 '16 at 15:16
  • 2
    Regarding "performance", is the startup-time that important? Or is it more important that it "performs" once it's started? How often will the program be started? Several times a day? Once a day? Once a week? How long will it run once started? A few minutes? Hours? Days? I'm just asking because any other method I know how to "embed" the file into the executable, relies on storing the raw unparsed data. Maybe your generated data source file should be built into a separate library, externally, and without the IDE really touching it? – Some programmer dude Nov 02 '16 at 15:20
  • @Someprogrammerdude every bit of resources is important in my case, and everything depends on the situation. I know where you're going, I am aware of the rules of optimization and I rarely break them. The application might be required to run every couple seconds on production servers of other applications, I need to limit CPU and memory usage as much as possible. Separate library unfortunately means I'd have a separate library (build) to manage. – Jezor Nov 02 '16 at 15:37
  • 1
    Managing the external "library" is (relatively) easy using the [`add_custom_command`](https://cmake.org/cmake/help/v3.7/command/add_custom_command.html) and [`add_custom_target`](https://cmake.org/cmake/help/v3.7/command/add_custom_target.html) commands. You don't really need a library, just an object file that you add to your main build. As long as you don't open the auto-generated source file, and put it in a separate directory that is excluded from CLion, then you should not have a problem. – Some programmer dude Nov 02 '16 at 15:43
  • @Jezor: If start-up time is as critical as you say, why are you relying on `vector` and `string`? If these are static objects, then you should be building a special data structure that doesn't do any dynamic allocation of any kind. `string` and `vector` dynamically allocate memory at runtime. – Nicol Bolas Nov 02 '16 at 15:45
  • @Someprogrammerdude I'll have to look into it, thank you (: – Jezor Nov 02 '16 at 15:50
  • 1
    @NicolBolas hm, isn't c++11 collection initialization allocating memory only once? All variables are read-only, I'm not adding nor deleting anything from the vector, so I guess it should be almost as fast as plain array? – Jezor Nov 02 '16 at 15:52
  • 1
    @Jezor: No. A `vector` is *always* dynamically allocated. `std::string` may or may not, but this determination cannot be made based on whether the string is `const`. It's made based on the particulars of `std::string`'s implementation and the length of the string. Small string optimization will prevent dynamic allocation, but that's based on the size of the string. Indeed, your above code may be doing lots of dynamic allocations. – Nicol Bolas Nov 02 '16 at 15:55
  • 1
    @NicolBolas Brace-initialization of a [`std::vector`](http://en.cppreference.com/w/cpp/container/vector) passes a [`std::initializer_list`](http://en.cppreference.com/w/cpp/utility/initializer_list) to the [vectors constructor](http://en.cppreference.com/w/cpp/container/vector/vector). This initializer-list object is created by the compiler at compile-time, and contains the size (number of elements). It would be a very stupid implementation if the vector didn't allocate only once and do in-place construction or even move the data from the initializer-list. – Some programmer dude Nov 02 '16 at 16:01
  • 2
    @Someprogrammerdude: I didn't say that `vector` would allocate more than once. Each individual `std::string` *within* that `vector` may have its own allocations, depending on the sizes of the strings and the presence of SSO on that `std::string` implementation. Plus, there's the fact that each `string` construction *also* will copy out of the string literal. – Nicol Bolas Nov 02 '16 at 16:01
  • 1
    @Jezor There's a simple way of benchmarking and timing this: Make a simple program, which basically have only the auto-generated initialization code, and a very simple `main` function which selects a random value and print it. Measure the time it takes to run the program with the output redirected to `/dev/null` (or similar on Windows if that's your platform). Do it for different sizes of the initialization list. Plot in a chart and see if you can find a trend. – Some programmer dude Nov 02 '16 at 16:38
  • 1
    See http://stackoverflow.com/questions/22455274/is-there-a-way-to-load-a-binary-file-as-a-const-variable-in-c-at-compile-time – sakra Nov 02 '16 at 17:06

1 Answers1

3

Since your build process already includes a pre-process, which generates C++ code from a CSV, this should be easy.

Step 1: Put most of the generated data in the .cpp file, not a header.

Step 2: Generate your code so that it doesn't use vector or string.

Here's how to do these:

struct Data
{
    string_view key;
    string_view value;
};

You will need an implementation of string_view or a similar type. While it was standardized in C++17, it doesn't rely on C++17 features.

As for the data structure itself, this is what gets generated in the header:

namespace defaults {
    extern const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data;
}

{{GENERATED_ARRAY_COUNT}} is the number of items in the array. That's all the generated header should expose. The generated .cpp file is a bit more complex:

static const char ptr[] =
    "SomeKeyA" "SomeValueA"
    "SomeKeyB" "SomeValueB"
    "SomeKeyC" "SomeValueC"
    ...
    "SomeKeyASFHOIEGEWG" "SomeValueASFHOIEGEWG"
;

namespace defaults 
{
  const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data =
  {
      {{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
      {{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
      ...
      {{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
  };
}

ptr is a string which is a concatenation of all of your individual strings. There is no need to put spaces or \0 characters or whatever between the individual strings. However, if you do need to pass these strings to APIs that take NULL-terminated strings, you'll either have to copy them into a std::string or have the generator stick \0 characters after each generated sub-string.

The point is that ptr should be a single, giant block of character data.

{{GENERATED_OFFSET}} and {{GENERATED_SIZE}} are offsets and sizes within the giant block of character data that represents a single substring.

This method will solve two of your problems. It will be much faster at load time, since it performs zero dynamic allocations. And it puts the generated strings in the .cpp file, thus making your IDE cooperate.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982