1

I am working on an assignment in a class that I now realize may be a little out of my reach (this is first sememster I have done any programming)

The stipulation is that I use paralell programming with mpi.

I have to input a csv file of up to a terabyte, of tick data (every micro second) that may be locally out of sort. run a process on the data to identify noise, and output a cleaned data file.

I have written a serial program using Pandas that takes the data determines significant outliers and writes them to a dataset labeled noise, then create the final data set by doing original minus noise based on the index (time)

I have no idea on where to start for parellizing the program. I understand that because my computations are all local, I should import from csv in paralell and run the process to identify noise.

I believe the best way to do this (and i may be completely wrong) is to scatter run the computation and gather using a hdf5. But i do not know how to implement this.

I do not want someone to write an entire code, but maybe a specific example of importing in paralell from csv and regathering the data, or a better approach to the problem.

  • 1
    maybe this helps: http://stackoverflow.com/questions/8424771/parallel-processing-of-a-large-csv-file-in-python – dahrens Feb 17 '17 at 20:54
  • 1
    When you say you've written a program "to take the data," what specifically do you mean: one row of the csv file? multiple rows? all of the rows? This is crucial to designing an approach to parralell-ize you solutions. – gregory Feb 17 '17 at 21:31
  • I use read_csv in chunks to put it in a pandas dataframe. – Shilo Wilson Feb 17 '17 at 22:08

1 Answers1

1

If you can boil down your program to a function to run against a list of rows, then yes a simple multiprocessing approach would be easy and effective. For instance:

from multiprocessing import Pool 

def clean_tickData(filename):
   <your code> 

pool = Pool()
    pool.map(clean_tickData, cvs_row)
    pool.close() 
    pool.join()

map from Pool runs in parallel. One can control how many parallel processes, but the default, set with empty Pool() call, starts as many processes as you have CPU cores. So, if you reduce your clean-up work to a function that can be run over the various rows in your cvs, using pool.map would be a easy and fast implementation.

gregory
  • 10,969
  • 2
  • 30
  • 42
  • 1
    thank you. this is the exact kind of outline I was looking for. And yes my code is fairly simple to define as a function. I will try this tonight! – Shilo Wilson Feb 17 '17 at 22:09