I am facing performance problem while reading data from mongodb using pymongo and creating data frame from the results of mongo query.
I have below type of documents in a collection and i have totally 5 million documents in a collection with the size of 4.7 Gb of data
{
"CUST_LEVEL1" : "ALL CHANNELS",
"CUST_LEVEL2" : "CUSTOMER CHANNEL",
"CUST_LEVEL3" : "Commercial Rebate",
"CUST_LEVEL4" : "Express Scripts Inc.",
"CUST_LEVEL5" : "UBC ESI MC NON STND",
"CUST_LEVEL6" : "BR - 103 OPEN 103-140Price Protection",
"PRODUCT_LEVEL1" : "Allergan USA Inc.",
"PRODUCT_LEVEL2" : "SPECIALIZE",
"PRODUCT_LEVEL3" : "MED DERM",
"PRODUCT_LEVEL4" : "TAZORAC",
"PRODUCT_LEVEL5" : 46,
"PRODUCT_LEVEL6" : "Tazorac .1% Gel 30 G",
"METHODOLOGY" : "",
"CALCULATION_PERIODS" : "",
"FREQ_CAL_START_PERIOD_SID" : "",
"FREQ_CAL_END_PERIOD_SID" : "",
"PERIOD_DATE" : "2015-01-01 00:00:00",
"ACCOUNT_GROWTH" : 2,
"PRODUCT_GROWTH" : 3,
"CCP_DETAILS_SID" : 563275,
"PROJECTION_SALES" : 0,
"PROJECTION_UNITS" : 0,
"ACTUAL_SALES" : 0,
"ACTUAL_UNITS" : 0,
"EXFACTORY_ACTUAL_SALES" : 0,
"EXFACTORY_ACTUAL_UNITS" : 0,
"EXFACTORY_CUST_ACTUAL_SALES" : 0,
"EXFACTORY_CUST_ACTUAL_UNITS" : 0,
"EXFACTORY_FORECAST_SALES" : 0,
"EXFACTORY_FORECAST_UNITS" : 0,
"EXFACTORY_CUST_FORECAST_UNITS" : 0,
"EXFACTORY_CUST_FORECAST_SALES" : 0,
"ITEM_PRICE" : 0
}
I have followed this answer to read data from mongodb using Pymongo and below code is used to read data from mongodb
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db['EVENTSTABLE'].find({},{"PERIOD_DATE":1,"CCP_DETAILS_SID":1,"ACCOUNT_GROWTH":1,"PRODUCT_GROWTH":1,"ACTUAL_SALES":1,
"EXFACTORY_FORECAST_SALES":1,"EXFACTORY_FORECAST_UNITS":1,"ITEM_PRICE":1,"PROJECTION_SALES":1})
# Expand the cursor and construct the DataFrame
fd = pd.DataFrame(list(cursor))
After reading data from mongodb i wrote below code to derive "PROJECTION_SALES" field.
fd['PROJECTION_SALES']=((((fd['ACCOUNT_GROWTH']/100)+1)*((fd['PRODUCT_GROWTH']/100)+1)*
((fd.groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['EXFACTORY_FORECAST_SALES'].transform('sum').divide(
fd.groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['EXFACTORY_FORECAST_UNITS'].transform('sum')).
fillna(fd['ITEM_PRICE']*fd['PERIOD_DATE'].dt.daysinmonth.divide((fd.assign(x=fd.PERIOD_DATE.dt.daysinmonth).groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['x'].transform('sum'))))))
.pct_change(periods=3).fillna(1)).cumprod()*(fd[fd['PERIOD_DATE'].dt.year==BASE].groupby([fd['PERIOD_DATE'].dt.year,'CCP_DETAILS_SID'])['ACTUAL_SALES'].transform('sum')))
And then finally updating back to mongodb collection.
MongoDb and Python running in two different machines with 8GB RAM in windows OS. When i run above Python code reading data from mongodb is taking too long time to bring data to Python this i can find it in Mongod.
I am reading entire data from collection. Will reading the data in chunks improve the performance?
Is Pymongo client the problem?
Are there any better ways to improve performance?