3

I'm looking to work on project involving the venmo dataset. I was able to torrent the bson file and it's sitting in my desktop, but I don't know what to do with it. I'm not too familar with MongoDB and i'm looking to turn it into a pandas dataframe for analysis. Anyone know any tips on doing so?

Hank Yun
  • 115
  • 2
  • 9

1 Answers1

5

Find below an Python example how to read a bson file:

import pandas as pd
import bson

FILE="/folder/file.bson"

with open(FILE,'rb') as f:
    data = bson.decode_all(f.read())

main_df=pd.DataFrame(data)
main_df.describe()
Alexey Vazhnov
  • 1,291
  • 17
  • 20
  • 2
    This is working when you use `import bson` of `pip install pymongo`, Mind that it is not working with the `import bson` of `pip install bson`. If you happen to have both installed, `pip install pymongo`'s `import bson` dominates that of `pip install bson`, but then you can also use `pip uninstall bson` anyway. If you ever need both packages, use `pip install pybson` and then `from pybson import bson as ...` instead, alternative name according to https://github.com/py-bson/bson/issues/70 – questionto42 Jun 20 '20 at 11:27
  • The current answer uses pymongo. Does anyone know how to do the same thing with the normal bson package (= pybson)? I only got a 1-row-df with the following code borrowed from https://stackoverflow.com/questions/27527982/read-bson-file-in-python: `b = open(mongodbbsonfilename, 'rb').read()` `bs = bson.loads(b)` `data = bson.decode_binary_subtype( bs, 2 )` `df = pd.DataFrame.from_dict(pd.json_normalize(data), orient='columns')` When I change read() to readfiles(), it is no BSON String anymore, but a list. – questionto42 Jun 20 '20 at 11:55
  • Follow up to the previous comment. The error: `TypeError: a bytes-like object is required, not 'list'`. I tried converting that list (previous comment) to BSON String using a similar approach as in this JSON problem: Then I tried a similar JSON approach, without success: https://pythonpedia.com/en/knowledge-base/48614158/read-json-file-as-pandas-dataframe- – questionto42 Jun 20 '20 at 12:47