0

I have a large number of strings and some data associated with each string. For simplicity, lets assume that the data is an int for each string. Lets assume I have an std::vector<std::tuple<std::string, int>>. I want to try to store this data continuously in memory with a single heap allocation. I will not need to worry about adding or deleting strings in the future.

A simple example

Constructing an std::string requires a heap allocation, and accessing entry chars of the std::string requires a dereference. If I have a bunch of strings, I may make better use of memory by storing all of the strings in one std::string and storing each string's starting index and size as a separate variable. If I want, I could try to store the starting index and size within the std::string itself.

Back to my problem

One idea I had was to store everything in an std::string or std::vector<char>. Each entry of the std::vector<std::tuple<std::string, int>> would be laid out in memory like this:

  1. length of next string (int or size_t)
  2. sequence of chars representing the string (chars)
  3. some number zero chars for correct int alignment (chars)
  4. data (int)

This requires being able to interpret a sequence of chars as an int. There have been questions about this before, but it seems to me that trying to do this can result in undefined behavior. I believe that I can help this slightly by checking the sizeof(int).

Another option I have is to create a union

union CharInt{
    char[sizeof(int)] some_chars;
    int data;
}

here, I would need to be careful that the number of chars per int used is determined at compile-time based on the result of sizeof(int). I would then store an std::vector<CharInt>. This seems more "C++" than using reinterpret_cast. One downside of this is that accessing the second char member of a CharInt would require an additional pointer addition (the pointer to the CharInt + 1). This cost still seems small relative to the benefit of making everything contiguous.

Is this the better option? Are there other options available? Are there pitfalls I need to account for using the union method?

Edit:

I wanted to provide clarity about how CharInt would be used. I provided an example below:

#include <iostream>
#include <string>
#include <vector>


class CharIntTest {
public:
    CharIntTest() {
        my_trie.push_back(CharInt{ 42 });
        std::string example_string{ "this is a long string" };
        my_trie.push_back(CharInt{ example_string, 5 });
        my_trie.push_back(CharInt{ 106 });
    }

    int GetFirstInt() {
        return my_trie[0].an_int;
    }

    char GetFirstChar() {
        return my_trie[1].some_chars[0];
    }

    char GetSecondChar() {
        return my_trie[1].some_chars[1];
    }

    int GetSecondInt() {
        return my_trie[2].an_int;
    }

private:

    union CharInt {
        // here I would need to be careful that I only insert sizeof(int) number of chars
        CharInt(std::string s, int index) : some_chars{ s[index], s[index+1], s[index+2], s[index+3]} {
        }

        CharInt(int i) : an_int{ i } {
        }

        char some_chars[sizeof(int)];
        int an_int;
    };

    std::vector<CharInt> my_trie;

};

Note that I do not access the first or third CharInts as though they were chars. I do not access the second CharInt as though it were an int. Here is the main:

int main() {
    CharIntTest tester{};

    std::cout << tester.GetFirstInt() << "\n";
    std::cout << tester.GetFirstChar() << "\n";
    std::cout << tester.GetSecondChar() << "\n";
    std::cout << tester.GetSecondInt();
}

which produces the desired output

42
i
s
106
user542101
  • 93
  • 5
  • You might want to look into Polymorphic Memory Resources. They can allow you to write “regular” code and get the data allocated roughly continuously. – Ben Oct 22 '21 at 00:18
  • 1
    you basically want to serialize your data. There are libraries for that, don't reinvent the wheel. – bolov Oct 22 '21 at 00:41
  • 1
    @Ben By "Polymorphic Memory Resources", do you mean this: https://en.cppreference.com/w/cpp/memory/polymorphic_allocator ? I don't know much about allocators, and I'm having trouble seeing as to how using a different allocator would help. – user542101 Oct 22 '21 at 02:18
  • 1
    To supplement bolov's comment, you may give it a try on flatbuffers, it's guaranteed to use continues memory layout. – prehistoricpenguin Oct 22 '21 at 02:27
  • @bolov thanks. I looked at serialization (flatbuffers), and that doesn't look like what I'm looking for. Those look to be about working with files. I'm just trying to create a contiguous trie because I want to learn about tries. I think serialization might be a solution for a problem I don't have. – user542101 Oct 22 '21 at 02:42
  • Type punning via `CharInt` is illegal in C++, you have to `memcpy` (or [`std::bit_cast`](https://en.cppreference.com/w/cpp/numeric/bit_cast) in C++20). – Jarod42 Oct 22 '21 at 08:47
  • `std::vector>` would be contiguous in memory (with you additional big `std::string`). – Jarod42 Oct 22 '21 at 08:52
  • Why do you want a sequence of `char`? – Jarod42 Oct 22 '21 at 08:52
  • @Jarod42 is it type punning? When I build my ```std::vector```, each entry ```CharInt``` will either be 2 ```char```s or an ```int```. If I ```push_back``` a ```CharInt``` that is 2 ```char```s, I will only access it as 2 ```char```s. If I ```push_back``` a ```CharInt``` that is 1 ```int```, I will only access it as 1 ```int```. – user542101 Oct 22 '21 at 11:26
  • `char[sizeof(int)]` seems to indicate it is related to `int`... `CharInt u; u.data = 42; foo(u.some_chars);` is UB. Not sure what you want to do with `CharInt` which isn't UB. `union` is mostly only useful to implement `std::variant`. – Jarod42 Oct 22 '21 at 12:01
  • @Jarod42 I have edited the OP with an example of how I would use ```CharInt```. As far as I can tell, I am not making use of UB. What do you think? – user542101 Oct 22 '21 at 12:45
  • 1
    Ok, no UB here. (assuming `sizeof(int) => 4`, so if you need fixed int, you might use [`std::int32_t`](https://en.cppreference.com/w/cpp/types/integer)). Notice though that you cannot pedantically neither accessing the string data over several `CharInt`, you need to read char 4 by 4 only. – Jarod42 Oct 22 '21 at 12:57
  • @Jarod42 "Notice though that you cannot pedantically neither accessing the string data over several CharInt, you need to read char 4 by 4 only." I'm not sure what you mean by this. – user542101 Oct 22 '21 at 13:36
  • I meant that `CharInt u[2]{{"hell", 0}, {"o!!\0", 0}}; std::cout << u[0].some_chars /*read 8 char*/;` is wrong. – Jarod42 Oct 22 '21 at 14:01
  • @user542101 That's right. This video shows how it's used: https://www.youtube.com/watch?v=q6A7cKFXjY0 It's pretty advanced, and looking at your problem again, I don't know that it's the right approach. – Ben Oct 22 '21 at 19:25
  • Be aware that `std::string` has a small-string optimization, so if you are using short strings, then there is no heap allocation. (Print `&str` and `str.c_str()` to see for yourself.) – Ben Oct 22 '21 at 19:42

0 Answers0