0

In my project I work with word vectors as numpy arrays with a dimension of 300. I want to store the processed arrays in a mongo database, base64 encoded, because this saves a lot of storage space.

Python code

import base64
import numpy as np

vector = np.zeros(300, dtype=np.float32) # represents some word-vector
vector = base64.b64encode(vector) # base64 encoding
# Saving vector to MongoDB...

In MongoDB it is saved in as binary like this. In C++ I would like to load this binary data as a std::vector. Therefore I have to decode the data first and then load it correctly. I was able to get the binary data into the c++ program with mongocxx and had it as a uint8_t* with a size of 1600 - but now I don't know what to do and would be happy if someone could help me. Thank you (:

C++ Code

const bsoncxx::document::element elem_vectors = doc["vectors"];
const bsoncxx::types::b_binary vectors = elemVectors.get_binary();

const uint32_t b_size = vectors.size; // == 1600
const uint8_t* first = vectors.bytes;

// How To parse this as a std::vector<float> with a size of 300?

Solution

I added these lines to my C++ code and was able to load a vector with 300 elements and all correct values.

    const std::string encoded(reinterpret_cast<const char*>(first), b_size);
    std::string decoded = decodeBase64(encoded);
    std::vector<float> vec(300);
    for (size_t i = 0; i < decoded.size() / sizeof(float); ++i) {
        vec[i] = *(reinterpret_cast<const float*>(decoded.c_str() + i * sizeof(float)));
    }

To mention: Thanks to @Holt's info, it is not wise to encode a Numpy array base64 and then store it as binary. Much better to call ".to_bytes()" on the numpy array and then store that in MongoDB, because it reduces the document size from 1.7kb (base64) to 1.2kb (to_bytes()) and then saves computation time because the encoding (and decoding!) doesn't have to be computed!

Christopher K
  • 37
  • 1
  • 6
  • "I want to store the processed arrays in a mongo database .. because this saves a lot of storage space.". If storage space is at a premium, do not use MongoDB. Its metadata overhead is pretty bad. If you have several thousands of these arrays, don't bother with Base64 and just store them as text. If you have several million, don't use MongoDB. – MSalters Dec 20 '21 at 13:13

1 Answers1

1

Thank @Holt for pointing out my mistake.

First, you can't save the storage space by using base64 encoding. On the contrary, it will waste your storage. For an array with 300 floats, the storage is only 300 * 4 = 1200bytes. While after you encode it, the storage will be 1600 bytes! See more about base64 here.

Second, you want to parse the bytes into a vector<float>. You need to decode the bytes if you still use the base64 encoding. I suggest you use some third-party library or try this question. Suppose you already have the decode function.

std::string base64_decode(std::string const& encoded_string); // or something like that.

You need to use reinterpret_cast to get the value.

const std::string encoded(first, b_size);
std::string decoded = base64_decode(encoded);
std::vector<float> vec(300);
for (size_t i = 0; i < decode.size() / sizeof(float); ++i) {
    vec[i] = *(reinterpret_cast<const double*>(decoded.c_str()) + i);
}
Nimrod
  • 2,908
  • 9
  • 20
  • Strangely enough, the base64 encoding saves a massive amount of memory in MongoDB. From 3.6kb (without encoding) to 1.7kb (with encoding) per document! I tried your solution and was able to parse a vector without runtime errors. But it does not consist of the correct values (from -1 to 1), but of integers like 43.00000 or 55.0000. Do you know why? Thanks for your help anyway – Christopher K Dec 20 '21 at 13:08
  • 1
    You need to `reinterpret_cast` before dereferencing, otherwise you're going to assign the `char` value instead of the `float` value. – Holt Dec 20 '21 at 13:08
  • @ChristopherK How do you save the data without encoding? As Nimrod said, your float vector is 32-bits per value, so 1200 bits, there is no reason MongoDB would use 3600 bits unless you store them improperly. – Holt Dec 20 '21 at 13:09
  • @Holt: Base-64 is a 6-bit encoding, using 64 out of 256 characters (3/4). That explains the 4/3 growth. – MSalters Dec 20 '21 at 13:10
  • @MSalters Sorry, I was referring to the "3.6kb without encoding", not the 1.7kb, I put the wrong number there. – Holt Dec 20 '21 at 13:12
  • @Holt: The 3.6 kB suggests 12 bytes/float, which is quite normal for a text representation (which MongoDB typically uses). – MSalters Dec 20 '21 at 13:14
  • @Holt I experimented a little bit. First I saved the word vector as a simple list in MongoDB. So with "vector.to_list()" into a document, which then needs 3.6kb of storage space. Then I encoded it and saved it again in a document and needed only 1.7kb of memory. If this is not the best way to save such data, please correct me! – Christopher K Dec 20 '21 at 13:16
  • @ChristopherK I would try saving `vector.data.tobytes()`, it's a binary array of size 1200. Then I don't know how MongoDB will store it, but if it can store binary data, then it should not take much more than 1200 bits. – Holt Dec 20 '21 at 13:17
  • Thanks to all of you, I was able to solve this. reinterpret_cast was the solution! @Holt I'll try yours right away, because it can save even more memory - thanks to you too!!! – Christopher K Dec 20 '21 at 13:29