0

I'm moving from MATLAB to python my algorithms and I have stuck in parallel processing

I need to process a very large amount of csv's (1 to 1M) with a large number of rows (10k to 10M) with 5 independent data columns.

I already have a code that does this, but with only one processor, loading csv's to a dictionary in RAM takes about 30 min(~1k csv's of ~100k rows).

The file names are in a list loaded from a csv(this is already done):

Amp Freq    Offset  PW  FileName
3   10000.0 1.5 1e-08   FlexOut_20140814_221948.csv
3   10000.0 1.5 1.1e-08 FlexOut_20140814_222000.csv
3   10000.0 1.5 1.2e-08 FlexOut_20140814_222012.csv
...

And the CSV in the form: (Example: FlexOut_20140815_013804.csv)

# TDC characterization output file , compress
# TDC time : Fri Aug 15 01:38:04 2014
#- Event index number
#- Channel from 0 to 15
#- Pulse width [ps] (1 ns precision)
#- Time stamp rising edge [ps] (500 ps precision)
#- Time stamp falling edge [ps] (500 ps precision)
##Event Channel Pwidth  TSrise  TSfall
0   6   1003500 42955273671237500   42955273672241000
1   6   1003500 42955273771239000   42955273772242500
2   6   1003500 42955273871241000   42955273872244500
...

I'm looking for something like MATLAB 'parfor' that takes the name from the list opens the files and put the data in a list of dictionary's. It's a list because there is an order in the files (PW), but in the examples I've found it seems to be more complicated to do this, so first I will try to put it in a dictonary and after I will arrange the data in a list.

Now I'm starting with the multiprocessing examples on the web: Writing to dictionary of objects in parallel I will post updates when I have a piece of "working" code.

Community
  • 1
  • 1
taquionbcn
  • 543
  • 1
  • 8
  • 25
  • Just thinking out loud... if you have 1k files of 100k rows, that is 100M rows. If the rows average 50 characters each, that means you have 5GB of data roughly before you start on any data structures to index it and manage it. I am hoping your machine has plenty of RAM. Can you load the dictionaries once and leave them loaded in a separate task that runs as a server and your other operations just connect to the server (using sockets or somesuch) to access the data without having to repeatedly reload? – Mark Setchell Aug 30 '14 at 09:28
  • Hi Mark, yes, ram is not a problem, I have 24GB of RAM and 100GB of swap memory, with matlab i'm used to process very large amount of data (50Gb or more). the problem is i'm python rookie but matlab pro, so my matlab codes are complex and optimized to get the most of my computer. – taquionbcn Aug 30 '14 at 09:43
  • 1
    If you're doing data manipulation and are coming from matlab, you should look into [`pandas`](http://pandas.pydata.org); you'll wind up reimplementing a lot of its functionality yourself otherwise. – DSM Aug 30 '14 at 13:24
  • possible duplicate: http://stackoverflow.com/questions/19941963/parallel-read-table-in-pandas – johntellsall Aug 30 '14 at 22:32
  • done!!, now I'm busy but when I have a time slot I will post the code for comments – taquionbcn Sep 08 '14 at 10:29

0 Answers0