How to store repeated data in an efficient way?

Question

I am trying to find an efficient structure to store data in c++. Efficiency in both time and storage is important.

I have a set S = {0, 1, 2, 3, ..., N}, and multiple levels, say L levels. For each level l ∈ {0, 1, ..., L}, I need a structure, say M, to store 2 subsets of S, which will be the rows and columns of a matrix in the next steps. For example:

S = {0, 1, 2, 3, ..., 15} 
L = 2
L1_row = {0, 1, 2, 3, 4, 5}
L1_col = {3, 4, 5, 6, 9}
L2_row = {0, 2, 4, 5, 12, 14}
L2_col = {3, 6, 10, 11, 14, 15}

I have used an unordered_map with integer keys as level and a pair of unordered_sets for row and column, as follow:

unordered_map<int, pair<unordered_set<int>, unordered_set<int>>> M;

However, this is not efficient, for example, {3, 4, 5} recorded 3 times. As S is a large set, so M will contain many repeated numbers.

In the next step, I will extract row and cols per level from M and create a matrix, so, fast access is important.
M may or may not contain all items in S.
M will fill in at 2 stages, first, rows for all levels, and then columns for all levels.

I have many questions... How are you building Lx_row and Lx_columns? How large S can get? Why M may not contain all items in S? — Oussama Ben Ghorbel, Feb 15 '22 at 14:52
If this isn't for school and you're doing linear algebra I'd suggest taking a look at the Eigen library that has already solved these issues for you. Or one of the many other linear algebra packages. — Mgetz, Feb 15 '22 at 14:53
It sounds like you want to both compress data and speed-up access, which is two opposing features, as compression and decompression require processing time. But you should provide an example of the next step as it's not clear what your access pattern is, along with constraints of your sets. There might be some clever maths that can be done there in order to solve this. — Ted Klein Bergman, Feb 15 '22 at 14:55
To save on storage, make the subset its own object {3, 4, 5} its own object and have each level store a pointer to the subset. Also the cost of inserting data into a data structure can be quite different from the cost of accessing data in that data structure (write vs read). You only mentioned "access" which I assume to be "read" - meaning (to me) that you're fine with taking a long time to optimize it before the read accesses begin. Is my assumption correct or is this a dynamically changing structure? — Wyck, Feb 15 '22 at 14:56
For example, it looks like your sets can vary greatly in size. So are you creating `MxN` matrices where M and N are any natural numbers? Also, have you looked at how frequent these repeats are, and of which size they are? Because you could create a lookup table for the repeated sets, but this might not even be worth it. — Ted Klein Bergman, Feb 15 '22 at 14:58
Is your matrix sparse or dense? And also your columns and rows? Either the individual set entries can be stored or a bit used for each column/row, whether it is filled. — Sebastian, Feb 15 '22 at 16:13
Thanks for your comment but the problem is not with matrices, creating matrices is in the next step. I only need to store 2 subsets of 'S' multiple times and I want to do it in an efficient way because these subsets increase size per level, which cause a lot of repeated storage. — Anna, Feb 15 '22 at 18:37
Have a look at Suffix trees, e.g. in [this answer](https://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english) — TilmannZ, Feb 16 '22 at 12:55

Hajo Kirchhoff · Answer 1 · 2022-02-15T16:40:20.207

That is a tough one. Memory and efficiency really depend on your concrete data set. If you don't want to store {3,4,5} 3 times, you'll have to create a "token" for it and use that instead.

There are patterns such as

flyweight or
run length encoding or
dictionary

for that. (boost-flyweight or dictionaries for ZIP/7Z and other compression algorithms).

However under some circumstances this can actually use more memory than just repeating the numbers.

Without further knowledge about the concrete use case it is very hard to suggest anything.

Example: Run length encoding is an efficient way to store {3,4,5,6,7...}. Basically you just store the first index and a length. So {3,4...12} becomes {{3, 10}}. That's easy to 'decode' and uses a lot less memory than the original sequence, but only if you have many consecutive sequences. If you have many short sequences it will be worse.

Another example: If you have lots of recurring patterns, say, {2,6,11} appears 23 times, then you could store this pattern as a flyweight or use a dictionary. You replace {2,6,11} with #4. The problem here is identifying patterns and optimizing your dictionary. That is one of the reasons why compressing files (7Zip, bzip etc...) takes longer than uncompressing.

Or you could even store the info in a bit pattern. If your matrix is 1024 columns wide, you could store the columns used in 128 Bytes with "bit set" == "use this column". Like a bitmask in image manipulation.

So I guess my answer really is: "it depends".

Thank you for answering my question. What I actually mean is that the numbers in 'S' may repeat in 'M' many times and I was looking for a way to store them more efficiently. there is not any special pattern for the repeated items. e.g. 3,4,5 was only an example to show I am storing these numbers different times, which in the real data number of repeated items will definitely increase. — Anna, Feb 16 '22 at 09:43
The question is, in what way they are repeated. The levels can be similar, than you would store the difference, they could have common sequences, they could have many intervals, where every number is in the set or every number is not. It also depends on the percentage of items in the subset. Is it filled 0.1% or 50% or 99.9%. Each space-optimizing implementation has a (possibly small) performance penalty for accessing. It also depends on the way, you access. Continuously reading, continuously writing, inserting, finding elements. One suitable data structure could be a sorted tree with pointers — Sebastian, Feb 16 '22 at 10:11
The most simple "optimization" would be to use the smallest datatype possible for your numbers. What is the minimum and what is the maximum number you have? I am guessing, the minimum number is 0. The maximum number might be less than 65536, unless you allowing more than 65536 rows/colums. Use `uint16_t` then, which requires two bytes per number. Which is a lot less than size_t (usually 4 or 8). — Hajo Kirchhoff, Feb 16 '22 at 12:16

How to store repeated data in an efficient way?

1 Answers1