I am trying to make my scripts more efficient.
At the moment, I have 10 scripts - they all read data in, process it and output it.
They all read from the same main DB tables though and just do different things with the data.
So I have consolidated to one script, with the idea that I only read data once, rather than 10 times.
Should that not result in a faster execution? Because it doesn't.
Below is an example of the structure I am using
Any help would be amazing
Thanks
'''
TABLE DEFINITION AND CACHING
'''
spark_session = create_session('usage CF')
usage_logs = spark_session.sql("Select * from db.table where dt = " + yday_date ).cache()
user_logs = spark_session.sql("Select * from db2.table2 where dt = " + yday_date ).cache()
usercat = spark_session.sql("Select * from db3.table3 where dt = " + yday_date ).cache()
radius_logs = spark_session.sql("Select * from db.table4 where dt = " + yday_date )
radius = radius_logs.select('emsisdn2', 'sessionid2', 'custavp1').cache()
'''
usage CF
'''
usage = usage_logs.select('field1', 'field2', 'field3')
conditions = [usage.sid == radius.sessionid2]
df3 = usage.join(radius, conditions, how='left')
df4 = df3.groupBy('field1', 'field2').agg(sqlfunc.sum('field3').alias('bytesdl'))
usage = df4.createOrReplaceTempView('usage')
usage_table_output = spark_session.sql(' insert overwrite table outputdb.outputtbl partition(dt = ' + yday_date + ') select "usage" as type, * from usage ')
'''
user CF
'''
user = usage_logs.filter((usage_logs.vslsessid == '0')).select('field1', 'field2', 'field3', 'field4')
conditionsx = [user.sessionid == radius.sessionid2]
user_joined = user.join(radius, conditionsx, how='left')
user_output = user_joined.groupBy('field1', 'field2', 'field3').agg(sqlfunc.sum('field4').alias('bytesdl'))
user = user_output.createOrReplaceTempView('user')
user_table_output = spark_session.sql(' insert overwrite table outputdb.outputtbl2 partition(dt = ' + yday_date + ') select "user" as type, * from user')