I want to know how to speed up running of the programs involving lot of computations on large datasets. So, I have 5 python programs, each of them perform some convoluted computations on a large dataset. For example, a portion in one of the program is as follows:
df = get_data_from_redshift()
cols = [{'Col': 'Col1', 'func': pd.Series.nunique},
{'col': 'Col2', 'func': pd.Series.nunique},
{'col': 'Col3', 'func': lambda x: x.value_counts().to_dict()},
{'col': 'Col4', 'func': pd.Series.nunique},
{'col': 'Col5', 'func': pd.Series.nunique}]
d = df.groupby('Column_Name').apply(lambda x: tuple(c['func'](x[c['col']]) for c in cols)).to_dict()
where get_data_from_redshift()
connects to a redshift cluster, get data from the database, writes it to a dataframe (the dataframe
is about 600,000 rows x 6 columns
).
The other programs also use this dataframe df
and perform a lot of computation and each program write its result in a pickle
file.
The final program loads the pickle
files created by the 5 programs, does some computations to get some 300,000 values and then checks them against another database in the cluster, to get a final file output
.
Running each program individually takes hours (sometimes overnight). However, I need that the whole thing runs within an hour and gives me the final output file
.
I tried putting the one of the programs on an EC2 instance to see if the performance improves, but it has been over 3 hours and it's still running. I tried m4.xlarge
, c4.xlarge
, r4.xlarge
instances, but none of them were useful.
Is there a way to speed up the total run time?
Maybe I could run each of the 5 programs on separate EC2 instances, but then each of the program give an output file, which the final program has to use. So, if I run on multiple instances, the output files from each program will be save on different servers, right? Then how will the final program use them? Can we save the output file from each file to a common location that the final program can access?
I've heard of GPUs being 14 times faster than CPUs, however I've never used them. Will using a GPU instance be of any help in this case?
Sorry, I'm new here, but don't really know how to go about it.