3

Well, I am making a c++ program, that goes through long streams of symbols and I need to store information for further analysis where in the stream appears symbol sequences of certain length. For instance in binary stream

100110010101

I have a sequences for example of length 6 like this:

  • 100110 starting on position 0
  • 001100 starting on position 1
  • 011001 starting on position 2
  • etc.

What I need to store are vectors of all positions where I can find the one certain sequence. So the result should be something like a table, maybe resembling a hash table that look like this:

sequence/ positions

10010101 | 1 13 147 515

01011011 | 67 212 314 571

00101010 | 2 32 148 322 384 419 455

etc.

Now, I figured mapping strings to integers is slow, so because I have information about symbols in the stream upfront, I can use it to map this fixed length sequences to an integer.

The next step was to create a map, that maps these "representing integers" to a corresponding index in the table, where I add next occurence of this sequence. However this is slow, much slower than I can afford. I tried both ordered and unordered map of both std and boost libraries, none having enough efficiency. And I tested it, the map is the real bottleneck here

And here is the loop in pseudocode:

for (int i=seqleng-1;i<stream.size();i++) {
    //compute characteristic value for the sequence by adding one symbol
    charval*=symb_count;
    charval+=sdata[j][i]-'0';
    //sampspacesize is number off all possible sequence with this symbol count and this length
    charval%=sampspacesize;
    map<uint64,uint64>::iterator &it=map.find(charval);
    //if index exists, add starting position of the sequence to the table
    if (it!=map.end()) {
        (table[it->second].add(i-seqleng+1);
    }
    //if current sequence is found for the first time, extend the table and add the index
    else {
        table.add_row();
        map[charval]=table.last_index;
        table[table.last_index].add(i-seqleng+1)
    }
}

So the question is, can I use something better than a map to keep the record of corresponding indeces in the table, or is this the best way possible?

NOTE: I know there is a fast way here, and thats creating a storage large enough for every possible symbol sequence (meaning if I have sequence of length 10 and 4 symbols, I reserve 4^10 slots and can omitt the mapping), but I am going to need to work with lengths and number of symbols that results in reserving amount of memory way beyond the computer's capacity. But the the actual number of used slots will not exceed 100 million (which is guaranteed by the maximal stream length) and that can be stored in a computer just fine.

Please ask anything if there is something unclear, this is my first large question here, so I lack experience to express myself the way others would understand.

TStancek
  • 310
  • 1
  • 12
  • you want to map sequence to positions, or position to sequence? – Richard Hodges Aug 28 '17 at 08:27
  • @RichardHodges There are three types of numbers, one is a number representing the sequence, one is index in the table, one is position. And I want to map the number representing the sequence to the index in the table(and under this index are positions of this sequence). – TStancek Aug 28 '17 at 08:33
  • It seems you have `std::map map` and `MyVector>`. Why not directly `std::map> map` ? – Jarod42 Aug 28 '17 at 08:38
  • @Jarod42 it slows down even more. The more complicated structure in the map, the slower the loading loop is. I went all the way down to reduce it to just loading one symbol at a time and mapping integers to integers, but from here I cant increase the speed, because those are the simplest data types and there is no other way than using map that I know of. – TStancek Aug 28 '17 at 08:46
  • BTW, you may have only one lookup by using `map::insert` instead of `map::find` + `map::operator[]`. – Jarod42 Aug 28 '17 at 08:53
  • 1
    If you are only using the symbols `1` and `0`, then why not store it as an `unordered_map` on eg `uint_8t` where your key is the *number* corresponding to your binary expansion? When reading your source string you just need to read character by character, lshift'ing as you go. – donkopotamus Aug 28 '17 at 09:03
  • @Jarod42 I do not follow ... I need to know whether the sequence was found before and therefore add other position, or add new vector and insert a position. How can I go around this using map and only map::insert?? – TStancek Aug 28 '17 at 09:18
  • Something like: `auto p = map.insert({charval, table.last_index}); if (p.second) { table.add_row(); } table[p.first->second].add(i-seqleng+1);`. See return value of `map::insert`. – Jarod42 Aug 28 '17 at 09:41
  • @Jarod42 can try that, but do you think the lookup will make up for the fact that internally the map must be working and therefore moving a more complicated structure than just an int?? I am not sure, but if you firmly believe so, I'll give it a try. – TStancek Aug 28 '17 at 09:47
  • `map` is node based, so I don't see why size of value really matter. BTW, the single look up can be done with both version of `map`. – Jarod42 Aug 28 '17 at 09:51
  • @Jarod42 OK, I am gonna have to try it with unordered, because ordered is quite slower, I guess it is because it balances itself, moving its internal node values all the time. . – TStancek Aug 28 '17 at 11:02
  • @Jarod42 I have modified your suggestion to suit my program and it resulted in a slight speedup, so thanks. – TStancek Aug 28 '17 at 12:59
  • You might want to turn your final code into answer. – Jarod42 Aug 28 '17 at 13:10
  • @Jarod42 I am sorry, it is not that simple, I made several modifications and my code actually is a bit more complicated than that, I omitted a lot of things, but from what I can tell based on the performance results, I guess the impromevent is done through using unordered_map > and accesing the vector via " vector &b=map[charval]; ", basically avoid branching when working with the map and accessing the vector directly, instead via another container. And it was your idea, so I think you should be the one posting that suggestion as an answer – TStancek Aug 28 '17 at 14:10
  • Mapping strings to integers doesn't have to be expensive, especially if the strings are binary or decimal digits. You can use a sliding-window to shift a bit out the top, and OR in a new bit at the bottom. For decimal digits, store them as 4 bits per digit (BCD), rather than multiplying by 10 and modulo by `10**6` to "shift out" a decimal digit from normal a binary integer. This is what @donkopotamus suggested. It may only be worth it if you use this integer as your own hash function into a `vector[]`, although it will speed up a comparison-based ordered `map`. – Peter Cordes Aug 28 '17 at 20:44

2 Answers2

5

An unordered map with pre-allocated space is usually the fastest way to store any kind of sparse data.

Given that std::string has SSO I can't see why something like this won't be about as fast as it gets:

(I have used an unordered_multimap but I may have misunderstood the requirements)

#include <unordered_map>
#include <string>
#include <iostream>

using sequence = std::string; /// @todo - perhaps replace with something faster if necessary

using sequence_position_map = std::unordered_multimap<sequence, std::size_t>;


int main()
{
    auto constexpr sequence_size = std::size_t(6);
    sequence_position_map sequences;
    std::string input = "11000111010110100011110110111000001111010101010101111010";

    if (sequence_size <= input.size()) {
        sequences.reserve(input.size() - sequence_size);

        auto first = std::size_t(0);
        auto last = input.size();

        while (first + sequence_size < last) {
            sequences.emplace(input.substr(first, sequence_size), first);
            ++first;
        }
    }

    std::cout << "results:\n";
    auto first = sequences.begin();
    auto last = sequences.end();
    while(first != last) {
        auto range = sequences.equal_range(first->first);

        std::cout << "sequence: " << first->first;
        std::cout << " at positions: ";
        const char* sep = "";
        while (first != range.second) {
            std::cout << sep << first->second;
            sep = ", ";
            ++first;
        }
        std::cout << "\n";
    }
}

output:

results:
sequence: 010101 at positions: 38, 40, 42, 44
sequence: 000011 at positions: 30
sequence: 000001 at positions: 29
sequence: 110000 at positions: 27
sequence: 011100 at positions: 25
sequence: 101110 at positions: 24
sequence: 010111 at positions: 46
sequence: 110111 at positions: 23
sequence: 011011 at positions: 22
sequence: 111011 at positions: 19
sequence: 111000 at positions: 26
sequence: 111101 at positions: 18, 34, 49
sequence: 011110 at positions: 17, 33, 48
sequence: 001111 at positions: 16, 32
sequence: 110110 at positions: 20
sequence: 101010 at positions: 37, 39, 41, 43
sequence: 010001 at positions: 13
sequence: 101000 at positions: 12
sequence: 101111 at positions: 47
sequence: 110100 at positions: 11
sequence: 011010 at positions: 10
sequence: 101101 at positions: 9, 21
sequence: 010110 at positions: 8
sequence: 101011 at positions: 7, 45
sequence: 111010 at positions: 5, 35
sequence: 011101 at positions: 4
sequence: 001110 at positions: 3
sequence: 100000 at positions: 28
sequence: 000111 at positions: 2, 15, 31
sequence: 100011 at positions: 1, 14
sequence: 110001 at positions: 0
sequence: 110101 at positions: 6, 36
Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • Thanks for the tip, I will try that, see whether I achieve some speed up. – TStancek Aug 28 '17 at 08:57
  • Sorry, I made mistake, it is not. The speed is about the same, I forgot to use multimap and worked only with map. – TStancek Aug 28 '17 at 09:13
  • @TStancek so we could try converting the strings to bits - that would give a marginally faster comparison. – Richard Hodges Aug 28 '17 at 09:58
  • I already did that, well not bits, but ints. The problem is I think in the ordered map, which is quite a lot slower than unordered in this case, that it balances itself all the time. Can I force it to create a balanced tree with a fixed depth, so that every stored element in a node is never moved? – TStancek Aug 28 '17 at 10:57
  • 1
    @TStancek I did some tests a while back to compare performance of unordered and ordered maps. The results were predictable and very much in favour of unordered maps. https://stackoverflow.com/questions/36392583/is-an-unordered-map-really-faster-than-a-map-in-practice – Richard Hodges Aug 28 '17 at 15:38
1

After many suggestions in comments and answer, I tested most of them and picked the fastest possibility, reducing the bottleneck caused by the mapping to almost the same time it ran without the "map"(but producing incorrect data, however I needed to find the minimum speed this can be reduced to)

This was achieved by replacing the unordered_map<uint64,uint> and vector<vector<uint>> with just unordered_map<uint64, vector<uint> >, more precisely boost::unordered_map. I tested it also with unord_map<string,vector<uint>> and it surprised me that it was not that much slower as I expected. However it was slower.

Also, probably due to the fact ordered_map moves nodes to remain a balanced tree in its internal structure, ord_map<uint64, vector<uint>> was a bit slower than ord_map<uint64,uint> together with vector<vector<uint>>. But since unord_map does not move its internal data during computation, seems that it is the fastest possible configuration one can use.

TStancek
  • 310
  • 1
  • 12