0

I want to know how to speed up running of the programs involving lot of computations on large datasets. So, I have 5 python programs, each of them perform some convoluted computations on a large dataset. For example, a portion in one of the program is as follows:

df = get_data_from_redshift()
cols = [{'Col': 'Col1', 'func': pd.Series.nunique}, 
{'col': 'Col2', 'func': pd.Series.nunique},
{'col': 'Col3', 'func': lambda x: x.value_counts().to_dict()},
{'col': 'Col4', 'func': pd.Series.nunique},
{'col': 'Col5', 'func': pd.Series.nunique}]

d = df.groupby('Column_Name').apply(lambda x: tuple(c['func'](x[c['col']]) for c in cols)).to_dict()

where get_data_from_redshift() connects to a redshift cluster, get data from the database, writes it to a dataframe (the dataframe is about 600,000 rows x 6 columns).

The other programs also use this dataframe df and perform a lot of computation and each program write its result in a pickle file.

The final program loads the pickle files created by the 5 programs, does some computations to get some 300,000 values and then checks them against another database in the cluster, to get a final file output.

Running each program individually takes hours (sometimes overnight). However, I need that the whole thing runs within an hour and gives me the final output file.

I tried putting the one of the programs on an EC2 instance to see if the performance improves, but it has been over 3 hours and it's still running. I tried m4.xlarge, c4.xlarge, r4.xlarge instances, but none of them were useful.

Is there a way to speed up the total run time?

Maybe I could run each of the 5 programs on separate EC2 instances, but then each of the program give an output file, which the final program has to use. So, if I run on multiple instances, the output files from each program will be save on different servers, right? Then how will the final program use them? Can we save the output file from each file to a common location that the final program can access?

I've heard of GPUs being 14 times faster than CPUs, however I've never used them. Will using a GPU instance be of any help in this case?

Sorry, I'm new here, but don't really know how to go about it.

user3666197
  • 1
  • 6
  • 50
  • 92
Xavier
  • 227
  • 1
  • 3
  • 11
  • Houston, we have a problem. "*I've heard of GPUs being 14 times faster than CPUs... Will using a GPU instance be of any help in this case?*" **I've heard, that blowing a smoke into the water produces gold. Will that help?** With all due respect, this is serious. Heavily convoluted process on a large, remote-hosted dataset will never qualify for a reasonable speedup on SIMD / SMX GPU hardware to **cut ~12hours [SEQ]-processing into "*I need that the whole thing runs within an hour***". Be realistic, be quantitatively supported ... serially convoluted processing has problems (ref. Amdahl's Law ) – user3666197 Sep 12 '17 at 11:10
  • If your principal process does not exhibit above 99% [PAR] portion of the process, leaving less than 1% for [SEQ] parts and avoiding as much as possible from distributed-processing setup / tear-down overheads, you will never be able to gain ~ 12x speedup you have asked for, even with infinite amount of parallel-processors. Simply the **Laws of diminishing returns** explain, why this will never, **indeed NEVER**, happen. May **use an interactive visual modelling tool for prototyping / proving these insights as available for scaling [SEQ]+[PAR] schedules at: https://stackoverflow.com/a/46124635 – user3666197 Sep 12 '17 at 11:16

1 Answers1

1

You need to find out what's making it slow, you can use a profiler to start with if you can not think of anything else at the moment. Finding out the exact problem is the simplest way of making it work better.

This, below is a generic approach.

First thing, optimizations in architecture/algorithms can substantially outperform any other optimization (like those provided by programming languages, tools, even techniques like memoization etc.). So first look thoroughly if your algorithm can be improved. It includes looking for parts that can run concurrently, which can be executed parallelly. For example, using map-reduce instead of linear data processing can lower execution times to fractions! But it needs that the processing should be able to be divided (mutually exclusively) for parallel processing.

Next should be finding unnecessary loops or computations. Using techniques like memoization can also improve performance greatly.

In case if there are any communication or I/O tasks (eg. communication to redshift cluster you've mentioned) that are time taking (this doesn't seem to relate to you as you've shown concerns about computation being slow)

Then there are minor optimization, like using functions like map, filter or using generator expressions instead of lists etc can optimize (very) slightly.