0

I have a dataset with 3 Mio lines to process. Processing functions are cythonized. When I do the entire processing on a small subsample of 10000 lines, processing time is about 1,5 minute, a subsample of 30000 lines gives a processing time of 3 min. However, when I process the whole dataset after 10 hours only 1/4th of the dataset is processed, although I expect a processing time of max. 5 hours. I'm running Ubuntu 14.04 64 Bit and Anaconda 64 bit. RAM usage is at 50%. I deactivated directing to login after a period of inactivity, performance stayed the same. Switching of the screen after inactivity didn't influence execution time eighter. What else could be the reason for this unexpectedly slow execution?

user3276418
  • 1,777
  • 4
  • 20
  • 29
  • I would suggest adding a logger to document the progress. You may want to find out if there are single items that take much more time or the processing for all items slows down. It's hard to diagnose the problem from your question. – cel Jan 29 '15 at 08:45
  • What do you mean by adding a logger? – user3276418 Jan 29 '15 at 14:47
  • Let your program write down the time when it has started and finished processing each item. Then check for anomalies... – cel Jan 29 '15 at 15:20
  • Run it on the big dataset and [*just do this*](http://stackoverflow.com/a/4299378/23771). You will see exactly where the problem is. – Mike Dunlavey Jan 29 '15 at 15:51

0 Answers0