I'm moving from MATLAB to python my algorithms and I have stuck in parallel processing
I need to process a very large amount of csv's (1 to 1M) with a large number of rows (10k to 10M) with 5 independent data columns.
I already have a code that does this, but with only one processor, loading csv's to a dictionary in RAM takes about 30 min(~1k csv's of ~100k rows).
The file names are in a list loaded from a csv(this is already done):
Amp Freq Offset PW FileName
3 10000.0 1.5 1e-08 FlexOut_20140814_221948.csv
3 10000.0 1.5 1.1e-08 FlexOut_20140814_222000.csv
3 10000.0 1.5 1.2e-08 FlexOut_20140814_222012.csv
...
And the CSV in the form: (Example: FlexOut_20140815_013804.csv)
# TDC characterization output file , compress
# TDC time : Fri Aug 15 01:38:04 2014
#- Event index number
#- Channel from 0 to 15
#- Pulse width [ps] (1 ns precision)
#- Time stamp rising edge [ps] (500 ps precision)
#- Time stamp falling edge [ps] (500 ps precision)
##Event Channel Pwidth TSrise TSfall
0 6 1003500 42955273671237500 42955273672241000
1 6 1003500 42955273771239000 42955273772242500
2 6 1003500 42955273871241000 42955273872244500
...
I'm looking for something like MATLAB 'parfor' that takes the name from the list opens the files and put the data in a list of dictionary's. It's a list because there is an order in the files (PW), but in the examples I've found it seems to be more complicated to do this, so first I will try to put it in a dictonary and after I will arrange the data in a list.
Now I'm starting with the multiprocessing examples on the web: Writing to dictionary of objects in parallel I will post updates when I have a piece of "working" code.