I have a large number of strings and some data associated with each string. For simplicity, lets assume that the data is an int
for each string. Lets assume I have an std::vector<std::tuple<std::string, int>>
. I want to try to store this data continuously in memory with a single heap allocation. I will not need to worry about adding or deleting strings in the future.
A simple example
Constructing an std::string
requires a heap allocation, and accessing entry chars
of the std::string
requires a dereference. If I have a bunch of strings, I may make better use of memory by storing all of the strings in one std::string
and storing each string's starting index and size as a separate variable. If I want, I could try to store the starting index and size within the std::string
itself.
Back to my problem
One idea I had was to store everything in an std::string
or std::vector<char>
. Each entry of the std::vector<std::tuple<std::string, int>>
would be laid out in memory like this:
- length of next string (
int
orsize_t
) - sequence of chars representing the string (
chars
) - some number zero chars for correct
int
alignment (chars
) - data (
int
)
This requires being able to interpret a sequence of char
s as an int
. There have been questions about this before, but it seems to me that trying to do this can result in undefined behavior. I believe that I can help this slightly by checking the sizeof(int)
.
Another option I have is to create a union
union CharInt{
char[sizeof(int)] some_chars;
int data;
}
here, I would need to be careful that the number of char
s per int
used is determined at compile-time based on the result of sizeof(int)
. I would then store an std::vector<CharInt>
. This seems more "C++" than using reinterpret_cast
. One downside of this is that accessing the second char
member of a CharInt
would require an additional pointer addition (the pointer to the CharInt
+ 1). This cost still seems small relative to the benefit of making everything contiguous.
Is this the better option? Are there other options available? Are there pitfalls I need to account for using the union
method?
Edit:
I wanted to provide clarity about how CharInt
would be used. I provided an example below:
#include <iostream>
#include <string>
#include <vector>
class CharIntTest {
public:
CharIntTest() {
my_trie.push_back(CharInt{ 42 });
std::string example_string{ "this is a long string" };
my_trie.push_back(CharInt{ example_string, 5 });
my_trie.push_back(CharInt{ 106 });
}
int GetFirstInt() {
return my_trie[0].an_int;
}
char GetFirstChar() {
return my_trie[1].some_chars[0];
}
char GetSecondChar() {
return my_trie[1].some_chars[1];
}
int GetSecondInt() {
return my_trie[2].an_int;
}
private:
union CharInt {
// here I would need to be careful that I only insert sizeof(int) number of chars
CharInt(std::string s, int index) : some_chars{ s[index], s[index+1], s[index+2], s[index+3]} {
}
CharInt(int i) : an_int{ i } {
}
char some_chars[sizeof(int)];
int an_int;
};
std::vector<CharInt> my_trie;
};
Note that I do not access the first or third CharInt
s as though they were char
s. I do not access the second CharInt
as though it were an int
. Here is the main
:
int main() {
CharIntTest tester{};
std::cout << tester.GetFirstInt() << "\n";
std::cout << tester.GetFirstChar() << "\n";
std::cout << tester.GetSecondChar() << "\n";
std::cout << tester.GetSecondInt();
}
which produces the desired output
42
i
s
106