In continuation with my earlier question to serialize bitsets to avoid creating bimap repeatedly on the same data, so save the bimap and load when needed.
I have chosen boost::bimap
to store data (in bitsets) in <key,value>
pair due to the reason that it uses hashing technique and need O(1) operation to search. The bimap
may have 40 million entries of bitsets and want to perform the following.
Insert bitsets in the
bimap
in the minimum possible time. Answer to my earlier question takes more time (nearly 5 seconds for .25 million bitset entries which are 5 fold when compared to the hash function given under 2 ). For the same reasonunordered_set_of
andunordered_multiset_of
is used.I want
bimap
to consume minimum memory as possible and avoid copying, unlike the following hash function.namespace std { template <typename Block, typename Alloc> struct hash<boost::dynamic_bitset<Block, Alloc> > { using bitset_type = boost::dynamic_bitset<Block, Alloc>; using block_type = typename bitset_type::block_type ; size_t operator()(boost::dynamic_bitset<Block, Alloc> const& bs) const { thread_local static std::vector<block_type> block_data; auto blocks = bs.num_blocks(); block_data.assign(blocks, 0); to_block_range(bs, block_data.begin()); return boost::hash<std::vector<block_type>>()(block_data); } }; }
O(1) search for a key/value.
Load bimap in a short time. Again, loading bimap takes much time (nearly 20 seconds for a bimap of .25 million entries, size 12 MB).
So I want to achieve 1,2,3, and 4 to my already asked question for which the answer code @sehe is shown below.
#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/bimap.hpp>
#include <boost/bimap/unordered_multiset_of.hpp>
#include <boost/bimap/unordered_set_of.hpp>
#include <boost/dynamic_bitset/serialization.hpp>
#include <fstream>
#include <iostream>
#include <string>
#include <boost/iostreams/device/back_inserter.hpp>
#include <boost/iostreams/stream_buffer.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/functional/hash.hpp>
namespace serial_hashing { // see https://stackoverflow.com/questions/30097385/hash-an-arbitrary-precision-value-boostmultiprecisioncpp-int
namespace io = boost::iostreams;
struct hash_sink {
hash_sink(size_t& seed_ref) : _ptr(&seed_ref) {}
typedef char char_type;
typedef io::sink_tag category;
std::streamsize write(const char* s, std::streamsize n) {
boost::hash_combine(*_ptr, boost::hash_range(s, s+n));
return n;
}
private:
size_t* _ptr;
};
template <typename T> struct hash_impl {
size_t operator()(T const& v) const {
using namespace boost;
size_t seed = 0;
{
iostreams::stream<hash_sink> os(seed);
archive::binary_oarchive oa(os, archive::no_header | archive::no_codecvt);
oa << v;
}
return seed;
}
};
}
namespace std {
template <typename Block, typename Alloc> struct hash<boost::dynamic_bitset<Block, Alloc> >
: serial_hashing::hash_impl<boost::dynamic_bitset<Block, Alloc> >
{};
} // namespace std
namespace bimaps = boost::bimaps;
using Bitset = boost::dynamic_bitset<>;
typedef boost::bimap<
bimaps::unordered_set_of<Bitset, std::hash<Bitset> >,
bimaps::unordered_multiset_of<Bitset, std::hash<Bitset> > > Index;
int main() {
using namespace std::string_literals;
{
std::cout << "# Writing binary file ... " << std::endl;
Index index;
index.insert({Bitset("10010"s), Bitset("1010110110101010101"s)});
std::ofstream ofs("binaryfile", std::ios::binary);
boost::archive::binary_oarchive oa(ofs);
oa << index;
}
{
std::cout << "# Loading binary file ... " << std::endl;
std::ifstream ifs("binaryfile", std::ios::binary); // name of loading file
boost::archive::binary_iarchive ia(ifs);
Index index;
ia >> index;
}
}
EDIT
AIM
I have a real life example where I have a large string say for example 2000 or more million characters and say for example 40-100 million short strings of length 200 or more characters. My aim is to search these short strings in the large string. I thought to create bimap
of bitsets for the large string and then search the short string there in the bimap. I also thought to use unordered
to get the insertion and search very fast unlike ordered
.
Key bitset length around 3-40 (all combination at a time).
Value bitset length around 100-2000 (only one at a time for example if it is 100, all the value entries will be around 90-110 or so).