I'm using the PYSPARK to extract the files and doing basic transformation and loading the data to HIVE. Using for loop to find the extract files and loading it to Hive. We have around 60 tables. Looping each file and loading take time. So using ThreadpoolExecutor to run the threads in parallel. Here is the sample code prototype.
def func(args):
df=extract(args)
tbl,status=load(df)
return tbl,status
def extract(args):
###finding file list and loading it to Hive####
return df
def load(df)
###Loading it to Hive###
status[tbl]='Completed'
return tbl,status
status={}
listA =['ABC','BCD','DEF']
prcs=[]
with futures.ThreadPoolExecutor() as executor:
for i in listA:
prcs.append(executor.submit(func,args))
for tsk in futures.as_completed(prcs):
tbl, status = future.result()
print(tbl)
print(status)
It works well. I'm redirecting the spark-submit log to a file. But while using threadpoolexecutor, logs are clumsy, cant debug anything. Any better way to group the logs based on thread. Here thread denotes the each table. I'm new to Python. Kindly help.