In my project I work with word vectors as numpy arrays with a dimension of 300. I want to store the processed arrays in a mongo database, base64 encoded, because this saves a lot of storage space.
Python code
import base64
import numpy as np
vector = np.zeros(300, dtype=np.float32) # represents some word-vector
vector = base64.b64encode(vector) # base64 encoding
# Saving vector to MongoDB...
In MongoDB it is saved in as binary like this. In C++ I would like to load this binary data as a std::vector. Therefore I have to decode the data first and then load it correctly. I was able to get the binary data into the c++ program with mongocxx and had it as a uint8_t* with a size of 1600 - but now I don't know what to do and would be happy if someone could help me. Thank you (:
C++ Code
const bsoncxx::document::element elem_vectors = doc["vectors"];
const bsoncxx::types::b_binary vectors = elemVectors.get_binary();
const uint32_t b_size = vectors.size; // == 1600
const uint8_t* first = vectors.bytes;
// How To parse this as a std::vector<float> with a size of 300?
Solution
I added these lines to my C++ code and was able to load a vector with 300 elements and all correct values.
const std::string encoded(reinterpret_cast<const char*>(first), b_size);
std::string decoded = decodeBase64(encoded);
std::vector<float> vec(300);
for (size_t i = 0; i < decoded.size() / sizeof(float); ++i) {
vec[i] = *(reinterpret_cast<const float*>(decoded.c_str() + i * sizeof(float)));
}
To mention: Thanks to @Holt's info, it is not wise to encode a Numpy array base64 and then store it as binary. Much better to call ".to_bytes()" on the numpy array and then store that in MongoDB, because it reduces the document size from 1.7kb (base64) to 1.2kb (to_bytes()) and then saves computation time because the encoding (and decoding!) doesn't have to be computed!