4

I am facing performance problem while reading data from mongodb using pymongo and creating data frame from the results of mongo query.

I have below type of documents in a collection and i have totally 5 million documents in a collection with the size of 4.7 Gb of data

{      
        "CUST_LEVEL1" : "ALL CHANNELS",
        "CUST_LEVEL2" : "CUSTOMER CHANNEL",
        "CUST_LEVEL3" : "Commercial Rebate",
        "CUST_LEVEL4" : "Express Scripts Inc.",
        "CUST_LEVEL5" : "UBC ESI MC NON STND",
        "CUST_LEVEL6" : "BR - 103 OPEN 103-140Price Protection",
        "PRODUCT_LEVEL1" : "Allergan USA Inc.",
        "PRODUCT_LEVEL2" : "SPECIALIZE",
        "PRODUCT_LEVEL3" : "MED DERM",
        "PRODUCT_LEVEL4" : "TAZORAC",
        "PRODUCT_LEVEL5" : 46,
        "PRODUCT_LEVEL6" : "Tazorac .1% Gel 30 G",
        "METHODOLOGY" : "",
        "CALCULATION_PERIODS" : "",
        "FREQ_CAL_START_PERIOD_SID" : "",
        "FREQ_CAL_END_PERIOD_SID" : "",
        "PERIOD_DATE" : "2015-01-01 00:00:00",
        "ACCOUNT_GROWTH" : 2,
        "PRODUCT_GROWTH" : 3,
        "CCP_DETAILS_SID" : 563275,
        "PROJECTION_SALES" : 0,
        "PROJECTION_UNITS" : 0,
        "ACTUAL_SALES" : 0,
        "ACTUAL_UNITS" : 0,
        "EXFACTORY_ACTUAL_SALES" : 0,
        "EXFACTORY_ACTUAL_UNITS" : 0,
        "EXFACTORY_CUST_ACTUAL_SALES" : 0,
        "EXFACTORY_CUST_ACTUAL_UNITS" : 0,
        "EXFACTORY_FORECAST_SALES" : 0,
        "EXFACTORY_FORECAST_UNITS" : 0,
        "EXFACTORY_CUST_FORECAST_UNITS" : 0,
        "EXFACTORY_CUST_FORECAST_SALES" : 0,
        "ITEM_PRICE" : 0
}

I have followed this answer to read data from mongodb using Pymongo and below code is used to read data from mongodb

db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db['EVENTSTABLE'].find({},{"PERIOD_DATE":1,"CCP_DETAILS_SID":1,"ACCOUNT_GROWTH":1,"PRODUCT_GROWTH":1,"ACTUAL_SALES":1,
           "EXFACTORY_FORECAST_SALES":1,"EXFACTORY_FORECAST_UNITS":1,"ITEM_PRICE":1,"PROJECTION_SALES":1})

# Expand the cursor and construct the DataFrame
fd =  pd.DataFrame(list(cursor))

After reading data from mongodb i wrote below code to derive "PROJECTION_SALES" field.

fd['PROJECTION_SALES']=((((fd['ACCOUNT_GROWTH']/100)+1)*((fd['PRODUCT_GROWTH']/100)+1)*
   ((fd.groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['EXFACTORY_FORECAST_SALES'].transform('sum').divide(
        fd.groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['EXFACTORY_FORECAST_UNITS'].transform('sum')).
fillna(fd['ITEM_PRICE']*fd['PERIOD_DATE'].dt.daysinmonth.divide((fd.assign(x=fd.PERIOD_DATE.dt.daysinmonth).groupby(['CCP_DETAILS_SID',fd['PERIOD_DATE'].dt.year])['x'].transform('sum'))))))
.pct_change(periods=3).fillna(1)).cumprod()*(fd[fd['PERIOD_DATE'].dt.year==BASE].groupby([fd['PERIOD_DATE'].dt.year,'CCP_DETAILS_SID'])['ACTUAL_SALES'].transform('sum')))

And then finally updating back to mongodb collection.

MongoDb and Python running in two different machines with 8GB RAM in windows OS. When i run above Python code reading data from mongodb is taking too long time to bring data to Python this i can find it in Mongod.

I am reading entire data from collection. Will reading the data in chunks improve the performance?

Is Pymongo client the problem?

Are there any better ways to improve performance?

James Z
  • 12,209
  • 10
  • 24
  • 44
Tharunkumar Reddy
  • 2,773
  • 18
  • 32

0 Answers0