0

I have a quite big collection of data, made by third parties, it contains a Bindata "package" column where plain ASCII is actually stored. I mean, they could have stored the text as string.

I have to export this collection in csv format, which works quite well with mongoexport, but the output contains the base64 encoded values for the "package" column. I need the actual text from that column, not BinData(0,\"NDYuN.....==").

What I've tried so far is update the collection with a new column "rawData", like this:

db.segments.find({"_id" : ObjectId("4fc79525f65181293930070b")}).forEach(function(data) {
  db.segments.update(
     {_id:data._id},
     {$set:{ "rawData" : data.package.toString() }}
  );
});

I've limited the find to just one document until I get it right. Unfortunately toString does not do the magic I expect.

Also, I've tried this:

db.segments.find({"_id" : ObjectId("4fc79525f65181293930070b")}).forEach(function(data){
  data.package = new String(data.package);
  db.segments.save(data); 
});

The result was even worse.

If I read the document with php, $response = $db->execute('return db.segments.findOne()'); then print_r($response) , I can validate that the data is properly stored, as base64.

I couldn't find a solution anywhere, perhaps because nobody ever needed to do something as stupid as this.

ceteras
  • 3,360
  • 2
  • 19
  • 14

1 Answers1

8

The JavaScript BinData object has base64() and hex() methods, which you can use to receive a string in the respective format. As for decoding those values into text strings within JS, you have some options:

According to the BSON spec, binary data fields are composed of a 32-bit integer length, subtype, and byte array. The base64 and hexadecimal representations returned in JS contain the byte array prepended with the integer length. That means after decoding the values, we'll want to strip off the first four bytes. Here's an example using both options:

// https://stackoverflow.com/a/3058974/162228
function decode_base64(s) {
    var e={},i,k,v=[],r='',w=String.fromCharCode,u=0;
    var n=[[65,91],[97,123],[48,58],[43,44],[47,48]];

    for(z in n){for(i=n[z][0];i<n[z][1];i++){v.push(w(i));}}
    for(i=0;i<64;i++){e[v[i]]=i;}
    function a(c){
      if(c<128)r+=w(c);else if(c>=192)u=c;else r+=w(((u&31)<<6)+(c&63));
    }

    for(i=0;i<s.length;i+=72){
        var b=0,c,x,l=0,o=s.substring(i,i+72);
        for(x=0;x<o.length;x++){
            c=e[o.charAt(x)];b=(b<<6)+c;l+=6;
            while(l>=8)a((b>>>(l-=8))%256);
        }
    }
    return r;
}

// https://stackoverflow.com/a/3745677/162228
function hex2a(hex) {
    var str = '';
    for (var i = 0; i < hex.length; i += 2)
        str += String.fromCharCode(parseInt(hex.substr(i, 2), 16));
    return str;
}

db.segments.find().forEach(function(doc){
    print(decode_base64(doc.package.base64()));
});

db.segments.find().forEach(function(doc){
    print(hex2a(doc.package.hex()));
});

And here's a small PHP script I used to insert some fixture data:

<?php

$mongo = new Mongo();
$c = $mongo->selectCollection('test', 'segments');
$c->drop();

$c->save(['package' => new MongoBinData('foo')]);
$c->save(['package' => new MongoBinData('bar')]);

foreach ($c->find() as $doc) {
    printf("%s\n", $doc['bindata']->bin);
}

Depending on the size of your data set, it's likely more reasonable to do the binary field conversion in PHP. If you do want to utilize JavaScript, I'd definitely suggest executing the script via the shell client instead of db.eval(), so you don't lock up the database with a long-running JS function.

FeepingCreature
  • 3,648
  • 2
  • 26
  • 25
jmikola
  • 6,892
  • 1
  • 31
  • 61
  • Thank you, this is a very good answer! I've chosen the php way, but also the other functions are working great. Only thing, I had to remove `.substr(4)` from the usage examples, because my data was "cropped" (the result started with 72247 instead of 46.772247) for both of them. – ceteras Jul 11 '12 at 08:20
  • Interesting. Mind if I ask what version of MongoDB you were using? I had 2.1.0 running for the above. If I insert binary data "foo", and then select it in the JS console, its `hex()` method returns `03000000666f6f`. – jmikola Jul 11 '12 at 15:33
  • Ah, I just tested again with 2.2-rc0 and it requires the `substr(4)` be removed, just as you're seeing for 1.8.2. Looks like that might have been a temporary bug in the unstable release. Thanks! – jmikola Aug 02 '12 at 16:41
  • The `base64()` method is exactly what I needed, thank you. – Vince Bowdren Jul 19 '16 at 10:11