0

I want to convert a dataframe to a bson file.

I am extracting data from a website using a library called "fundamentos". Using a method this libary return a dataframe and I want to convert this dataframe to a bson file.

I have tried to convert this dataframe to a json file, which later I converted to a bson file. But the Id of this bson file is not a ObjectId, and I need it to be an ObjectId. Anyone know a different method to do this?

1 Answers1

1

IIUC, fundementos seems to return a pandas.DataFrame, so you can use to_dict along with json_util from to make your bson file :

#https://stackoverflow.com/a/12983651/16120011
#IMPORTANT NOTE: make sure to use the bson module installed by pymongo

import pandas as pd
from bson import ObjectId
from bson.json_util import dumps

df = pd.DataFrame({"userid": [4, 1, 3, 2], "username": ["foo", "bar", "baz", "qux"]})

#https://www.mongodb.com/docs/manual/core/document/#the-_id-field
df.insert(0, "_id", [ObjectId() for _ in range(len(df))])

with open("output.bson", "wb") as file:
    file.write(dumps(df.to_dict(orient="records")).encode("utf-8"))

Output :

print(df)
                        _id  userid username
0  6462bdcdf855f712f8505b6d       4      foo
1  6462bdcdf855f712f8505b6e       1      bar
2  6462bdcdf855f712f8505b6f       3      baz
3  6462bdcdf855f712f8505b70       2      qux

#output.bson
[{"_id": {"$oid": "6462bdcdf855f712f8505b6d"}, "userid": 4, "username": "foo"}, {"_id": {"$oid": "6462bdcdf855f712f8505b6e"}, "userid": 1, "username": "bar"}, {"_id": {"$oid": "6462bdcdf855f712f8505b6f"}, "userid": 3, "username": "baz"}, {"_id": {"$oid": "6462bdcdf855f712f8505b70"}, "userid": 2, "username": "qux"}]

To read back the bson file as a DataFrame, you can use :

 from bson.json_util import loads

 with open("output.bson", "r") as b:
    dfback =  pd.DataFrame(loads(b.read()))

Output :

print(dfback)
                        _id  userid username
0  6462bdcdf855f712f8505b6d       4      foo
1  6462bdcdf855f712f8505b6e       1      bar
2  6462bdcdf855f712f8505b6f       3      baz
3  6462bdcdf855f712f8505b70       2      qux
Timeless
  • 22,580
  • 4
  • 12
  • 30