Python pandas and MongoDB performance

Question

I am facing performance problem while reading data from mongodb using pymongo and creating data frame from the results of mongo query.

I have below type of documents in a collection and i have totally 5 million documents in a collection with the size of 4.7 Gb of data

{      
        "CUST_LEVEL1" : "ALL CHANNELS",
        "CUST_LEVEL2" : "CUSTOMER CHANNEL",
        "CUST_LEVEL3" : "Commercial Rebate",
        "CUST_LEVEL4" : "Express Scripts Inc.",
        "CUST_LEVEL5" : "UBC ESI MC NON STND",
        "CUST_LEVEL6" : "BR - 103 OPEN 103-140Price Protection",
        "PRODUCT_LEVEL1" : "Allergan USA Inc.",
        "PRODUCT_LEVEL2" : "SPECIALIZE",
        "PRODUCT_LEVEL3" : "MED DERM",
        "PRODUCT_LEVEL4" : "TAZORAC",
        "PRODUCT_LEVEL5" : 46,
        "PRODUCT_LEVEL6" : "Tazorac .1% Gel 30 G",
        "METHODOLOGY" : "",
        "CALCULATION_PERIODS" : "",
        "FREQ_CAL_START_PERIOD_SID" : "",
        "FREQ_CAL_END_PERIOD_SID" : "",
        "PERIOD_DATE" : "2015-01-01 00:00:00",
        "ACCOUNT_GROWTH" : 2,
        "PRODUCT_GROWTH" : 3,
        "CCP_DETAILS_SID" : 563275,
        "PROJECTION_SALES" : 0,
        "PROJECTION_UNITS" : 0,
        "ACTUAL_SALES" : 0,
        "ACTUAL_UNITS" : 0,
        "EXFACTORY_ACTUAL_SALES" : 0,
        "EXFACTORY_ACTUAL_UNITS" : 0,
        "EXFACTORY_CUST_ACTUAL_SALES" : 0,
        "EXFACTORY_CUST_ACTUAL_UNITS" : 0,
        "EXFACTORY_FORECAST_SALES" : 0,
        "EXFACTORY_FORECAST_UNITS" : 0,
        "EXFACTORY_CUST_FORECAST_UNITS" : 0,
        "EXFACTORY_CUST_FORECAST_SALES" : 0,
        "ITEM_PRICE" : 0
}

I have followed this answer to read data from mongodb using Pymongo and below code is used to read data from mongodb

db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db['EVENTSTABLE'].find({},{"PERIOD_DATE":1,"CCP_DETAILS_SID":1,"ACCOUNT_GROWTH":1,"PRODUCT_GROWTH":1,"ACTUAL_SALES":1,
           "EXFACTORY_FORECAST_SALES":1,"EXFACTORY_FORECAST_UNITS":1,"ITEM_PRICE":1,"PROJECTION_SALES":1})

# Expand the cursor and construct the DataFrame
fd =  pd.DataFrame(list(cursor))

After reading data from mongodb i wrote below code to derive "PROJECTION_SALES" field.

fd['PROJECTION_SALES']=((((fd['ACCOUNT_GROWTH']/100)+1)*((fd['PRODUCT_GROWTH']/100)+1)*
   ((fd.groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['EXFACTORY_FORECAST_SALES'].transform('sum').divide(
        fd.groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['EXFACTORY_FORECAST_UNITS'].transform('sum')).
fillna(fd['ITEM_PRICE']*fd['PERIOD_DATE'].dt.daysinmonth.divide((fd.assign(x=fd.PERIOD_DATE.dt.daysinmonth).groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['x'].transform('sum'))))))
.pct_change(periods=3).fillna(1)).cumprod()*(fd[fd['PERIOD_DATE'].dt.year==BASE].groupby([fd['PERIOD_DATE'].dt.year,'CCP_DETAILS_SID'])['ACTUAL_SALES'].transform('sum')))

And then finally updating back to mongodb collection.

MongoDb and Python running in two different machines with 8GB RAM in windows OS. When i run above Python code reading data from mongodb is taking too long time to bring data to Python this i can find it in Mongod.

I am reading entire data from collection. Will reading the data in chunks improve the performance?

Is Pymongo client the problem?

Are there any better ways to improve performance?

Have you figured any better way to overcome the issue? – Chandhan Narayanareddy Jun 02 '22 at 10:08 — Chandhan Narayanareddy, Jun 02 '22 at 10:08

Python pandas and MongoDB performance

0 Answers0