I am working on an assignment in a class that I now realize may be a little out of my reach (this is first sememster I have done any programming)
The stipulation is that I use paralell programming with mpi.
I have to input a csv file of up to a terabyte, of tick data (every micro second) that may be locally out of sort. run a process on the data to identify noise, and output a cleaned data file.
I have written a serial program using Pandas that takes the data determines significant outliers and writes them to a dataset labeled noise, then create the final data set by doing original minus noise based on the index (time)
I have no idea on where to start for parellizing the program. I understand that because my computations are all local, I should import from csv in paralell and run the process to identify noise.
I believe the best way to do this (and i may be completely wrong) is to scatter run the computation and gather using a hdf5. But i do not know how to implement this.
I do not want someone to write an entire code, but maybe a specific example of importing in paralell from csv and regathering the data, or a better approach to the problem.