14

I want to read a BSON format Mongo dump in Python and process the data. I am using the Python bson package (which I'd prefer to use rather than have a pymongo dependency), but it doesn't explain how to read from a file.

This is what I'm trying:

bson_file = open('statistics.bson', 'rb')
b = bson.loads(bson_file)
print b[0]

But I get:

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    b = bson.loads(bson_file)
  File "/Library/Python/2.7/site-packages/bson/__init__.py", line 75, in loads
    return decode_document(data, 0)[1]
  File "/Library/Python/2.7/site-packages/bson/codec.py", line 235, in decode_document
    length = struct.unpack("<i", data[base:base + 4])[0]
TypeError: 'file' object has no attribute '__getitem__'

What am I doing wrong?

Richard
  • 62,943
  • 126
  • 334
  • 542

3 Answers3

15

I found this worked for me with a mongodb 2.4 BSON file and PyMongo's 'bson' module:

import bson
with open('survey.bson','rb') as f:
    data = bson.decode_all(f.read())

That returned a list of dictionaries matching the JSON documents stored in that mongo collection.

The f.read() data looks like this in a BSON:

>>> rawdata[:100]
'\x04\x01\x00\x00\x12_id\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02_type\x00\x07\x00\x00\x00simple\x00\tchanged\x00\xd0\xbb\xb2\x9eI\x01\x00\x00\tcreated\x00\xd0L\xdcfI\x01\x00\x00\x02description\x00\x14\x00\x00\x00testing the bu'        
samkass
  • 5,774
  • 2
  • 21
  • 25
Marc Maxmeister
  • 4,191
  • 4
  • 40
  • 54
  • but the approved answer is a better way to go – Marc Maxmeister Feb 13 '17 at 13:33
  • 4
    In oder to use the code of this answer, you need to use `pip install pymongo` instead of `pip install bson` = `pip install pybson`. Both have the `import bson` statement. See https://github.com/py-bson/bson/issues/70 – questionto42 Jun 20 '20 at 12:10
  • @Max Maxmeister Up to now, the other answer's example does not help me with the quite normal case of a file with multiple lines of a mongodb bson dump file, and unless that is not answered on https://stackoverflow.com/questions/58474479/bson-file-to-pandas-dataframe, the other answer is not just a better way to go. – questionto42 Jun 20 '20 at 12:14
  • @Lorenz I consider `bson.loads` better because that function was designed for this exact use case. My answer deals with the raw data, and I suppose it may be more flexible if you're loading something that is not properly encoded. IF you need to load each line in file separately, you might try the `bson.loads` on each line instead? – Marc Maxmeister Jun 23 '20 at 02:46
  • **pybson**'s `bson.loads` seems to be complicating things as soon as you have multiple lines. See the comments on the pybson answer. At least from what I have experienced in practice. Your answer with the **pymongo**'s `import bson` works fine. – questionto42 Jul 01 '20 at 21:27
  • I do not understand why my edit of the answer was not accepted. "Trusted SO members" had to decide this, and they ignored the edit. bson module exists both in pymongo package and in bson package. The current expression "python's bson module" of this answer here has cost me several hours to find out that this answer is not based on the bson package as the other answer is. (That is why I have asked bson to also offer its package under another name, which is now done with pybson, see again https://github.com/py-bson/bson/issues/70) – questionto42 Jul 04 '20 at 09:22
9

The documentation states :

> help(bson.loads)
Given a BSON string, outputs a dict.

You need to pass a string. For example:

> b = bson.loads(bson_file.read())
njzk2
  • 38,969
  • 7
  • 69
  • 107
  • 1
    I need this for the quite normal example of a bson file with multiple lines, where I thus use .readlines() instead of .read(). This throws an error. At best, answer on https://stackoverflow.com/questions/58474479/bson-file-to-pandas-dataframe – questionto42 Jun 20 '20 at 12:32
  • do you mean multiple bson objects in the file, one per line? would `b = [bson.loads(line) for line in bson_file.readlines()]` work in that case? – njzk2 Jun 20 '20 at 18:27
  • Throws error: `[pybson.loads(line) for line in open(filename,'rb').readlines()] Traceback (most recent call last): File "", line 1, in File "", line 1, in File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pybson\__init__.py", line 47, in loads return decode_document(data, 0)[1] File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pybson\codec.py", line 277, in decode_document if data[end_point - 1] not in ('\0', 0): IndexError: index out of range` – questionto42 Jun 24 '20 at 13:15
  • I have installed and imported it with `import pybson` as that can be differenciated from pymongo `import bson`, see https://github.com/py-bson/bson/issues/70. Your `import bson` is the same as long as you do not have pymongo installed (as pymongo `import bson` dominates). – questionto42 Jun 24 '20 at 13:26
  • readlines() does not split the string correctly, what is needed is `with open(filename,'rb') as f: b = f.read() listStrByte = re.split(rb'(?!\A)(?=\\!)', b) print(len(listStrByte)) listBs = [] for i in listStrByte[:2]: listBs.append(pybson.loads(i)) ` See https://stackoverflow.com/questions/62591863/split-a-string-and-keep-the-delimiters-as-part-of-the-split-string-chunks-not-a for details about this regex split. I stopped the loop at [:2], take this constraint away after tests. My limited pc crashed at [:9], thus I use the easier pymongo now, see the pymongo answer. – questionto42 Jul 01 '20 at 21:18
  • 2
    this will give an attribute error. bson doesnt support loads. I tried running it and got an error like: AttributeError: module 'bson' has no attribute 'loads' – Puja Bhattacharya Aug 13 '20 at 12:14
  • But how do you define the bson_file? – seeker_after_truth May 21 '22 at 15:42
  • 1
    @seeker_after_truth that's just any file-like object – njzk2 May 30 '22 at 20:34
  • @njk2, thank you, but what actually is a "file-like object" / where can I go to learn more about what this is? – seeker_after_truth Sep 19 '22 at 16:47
  • @PujaBhattacharya I'm getting the same error (AttributeError: module 'bson' has no attribute 'loads'). Is the problem that there are 2 bson packages and we are using a different one from @njzk2? – seeker_after_truth Sep 19 '22 at 16:50
  • https://github.com/py-bson/bson not much ambiguity there – njzk2 Sep 19 '22 at 17:07
2

loads expects a string (that's what the 's' stands for), not a file. Try reading from the file, and passing the result to loads.

Wander Nauta
  • 18,832
  • 1
  • 45
  • 62