25

So I have an enormous quantity of .bson from a MongoDB dump. I am using bsondump on the command line, piping the output as stdin to python. This successfully converts from BSON to 'JSON' but it is in fact a string, and seemingly not legal JSON.

For example an incoming line looks like this:

{ "_id" : ObjectId( "4d9b642b832a4c4fb2000000" ),
  "acted_at" : Date( 1302014955933 ),
  "created_at" : Date( 1302014955933 ),
  "updated_at" : Date( 1302014955933 ),
  "_platform_id" : 3,
  "guid" : 72106535190265857 }

Which I belive is Mongo Extended JSON.

When I read in such a line and do:

json_line = json.dumps(line)

I get:

"{ \"_id\" : ObjectId( \"4d9b642b832a4c4fb2000000\" ),
\"acted_at\" : Date( 1302014955933 ),
\"created_at\" : Date( 1302014955933 ),
\"updated_at\" : Date( 1302014955933 ),
\"_platform_id\" : 3,
\"guid\" : 72106535190265857 }\n"

Which is still <type 'str'>.

I have also tried

json_line = json.dumps(line, default=json_util.default)

(see pymongo json_util - spam detection prevents a third link ) Which seems to output the same as dumps above. loads gives an error:

json_line = json.loads(line, object_hook=json_util.object_hook)
ValueError: No JSON object could be decoded

So, how can I transform the string of TenGen JSON into parseable JSON? (the end goal is to stream tab separated data to another database)

Peter Nachbaur
  • 253
  • 1
  • 3
  • 6

4 Answers4

19

What you have is a dump in Mongo Extended JSON in TenGen mode (see here). Some possible ways to go:

  1. If you can dump again, use Strict output mode through the MongoDB REST API. That should give you real JSON instead of what you have now.

  2. Use bson from http://pypi.python.org/pypi/bson/ to read the BSON you already have into Python data structures and then do whatever processing you need on those (possibly outputting JSON).

  3. Use the MongoDB Python bindings to connect to the database to get the data into Python, and then do whatever processing you need. (If needed, you could set up a local MongoDB instance and import your dumped files into that.)

  4. Convert the Mongo Extended JSON from TenGen mode to Strict mode. You could develop a separate filter to do it (read from stdin, replace TenGen structures with Strict structures, and output the result on stdout) or you could do it as you process the input.

Here's an example using Python and regular expressions:

import json, re
from bson import json_util

with open("data.tengenjson", "rb") as f:
    # read the entire input; in a real application,
    # you would want to read a chunk at a time
    bsondata = f.read()

    # convert the TenGen JSON to Strict JSON
    # here, I just convert the ObjectId and Date structures,
    # but it's easy to extend to cover all structures listed at
    # http://www.mongodb.org/display/DOCS/Mongo+Extended+JSON
    jsondata = re.sub(r'ObjectId\s*\(\s*\"(\S+)\"\s*\)',
                      r'{"$oid": "\1"}',
                      bsondata)
    jsondata = re.sub(r'Date\s*\(\s*(\S+)\s*\)',
                      r'{"$date": \1}',
                      jsondata)

    # now we can parse this as JSON, and use MongoDB's object_hook
    # function to get rich Python data structures inside a dictionary
    data = json.loads(jsondata, object_hook=json_util.object_hook)

    # just print the output for demonstration, along with the type
    print(data)
    print(type(data))

    # serialise to JSON and print
    print(json_util.dumps(data))

Depending on your goal, one of these should be a reasonable starting point.

Fabian Fagerholm
  • 4,099
  • 1
  • 35
  • 45
  • Yes, I linked to that Extended JSON page in my question. I have tried the BSON library and it didn't accomplish my goal. bsondump was the only thing I could get to work, but it is giving me a string. Dumping or reloading the data are not viable options. – Peter Nachbaur Aug 09 '12 at 15:34
  • 1
    @PeterNachbaur: I added an option to my answer to show how you could convert the TenGen JSON format to Strict JSON. Is that more what you had in mind? – Fabian Fagerholm Aug 10 '12 at 08:58
  • Thanks for continuing to help :) now the loads works. At the end I assume you mean json.dumps not json_util.dumps (latter doesn't seem to exist) but that doesn't work. However, I'm not sure I need it now that I have a dict. Cheers! – Peter Nachbaur Aug 10 '12 at 14:47
  • @PeterNachbaur: No problem, it was fun to think about this! :) Strange, I have a bson.json_util.dumps function. But maybe it only exists in the development branch of the mongo-python code. Anyway, you've got the dict, so you can do whatever you want now ;) Good luck! – Fabian Fagerholm Aug 10 '12 at 18:33
  • Nice approach. regex :) – igorkf Nov 07 '19 at 12:41
8

loading an entire bson document into python memory is expensive.

If you want to stream it in rather than loading the whole file and doing a load all, you can try this library.

https://github.com/bauman/python-bson-streaming

from bsonstream import KeyValueBSONInput
from sys import argv
for file in argv[1:]:
    f = open(file, 'rb')
    stream = KeyValueBSONInput(fh=f,  fast_string_prematch="somthing") #remove fast string match if not needed
    for id, dict_data in stream:
        if id:
         ...process dict_data...
bauman.space
  • 1,993
  • 13
  • 15
7

You can convert lines of the bson file like this:

>>> import bson
>>> bs = open('file.bson', 'rb').read()
>>> for valid_dict in bson.decode_all( bs ):
....

Each valid_dict element will be a valid python dict that you can convert to json.

Emily S
  • 369
  • 1
  • 4
  • That for loop doesn't work :( Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/bson/__init__.py", line 473, in decode_all end = len(data) - 1 TypeError: object of type 'file' has no len() – Peter Nachbaur Aug 09 '12 at 16:26
-1

You can strip out the data-types and get a strict json with regex:

import json
import re

#This will outputs a iterator that converts each file line into a dict.
def readBsonFile(filename):
    with open(filename, "r") as data_in:
        for line in data_in:
            # convert the TenGen JSON to Strict JSON
            jsondata = re.sub(r'\:\s*\S+\s*\(\s*(\S+)\s*\)',
                              r':\1',
                              line)

            # parse as JSON
            line_out = json.loads(jsondata)

            yield line_out
Maviles
  • 3,209
  • 2
  • 25
  • 39