0

I am very new to multithreading to Python and any other language, and I am trying to use multithreading to improve the speed of my program.

Basically, I have many large dataset, and my memory can only fit in 2 of them at the same time, so the solution I have in mind is to read a single dataset first, then load the second dataset while processing the first dataset using multithreading. In this way, I can save the time of waiting for the second dataset to get loaded. Does it work?

DiveIntoML
  • 2,347
  • 2
  • 20
  • 36
  • 1
    You may want to look at [this post](http://stackoverflow.com/questions/990102/python-global-interpreter-lock-gil-workaround-on-multi-core-systems-using-task), and also look into Python's global interpreter lock (GIL) and how it limits threading in applications. From the above link, this snippet seems relevant: _'But that this can seriously backfire on multi-core systems and you end up with IO intensive threads being heavily blocked by CPU intensive threads, the expense of context switching, the ctrl-C problem[*] and so on.'_ – PrestonH Jan 17 '17 at 19:28
  • Are the files stored locally? Or are you pulling from a server? – Navidad20 Jan 17 '17 at 19:30
  • Run benchmarks. If a second concurrent query significantly slows down the first query, you can save time with your approach. – Dávid Horváth Jan 17 '17 at 19:30
  • 1
    You should use non-blocking I/O and an event loop instead of multithreading. Or maybe two processes that alternate loading and processing data. (If using Python 3.4 or higher is an option, you can use asyncio.) – Sven Marnach Jan 17 '17 at 19:30
  • There are some good PyCon videos - search for ```async```, ```await```, and ```concurrency```. It should be fairly easy to set up a simple test to see if your process will benifit. – wwii Jan 17 '17 at 19:38

0 Answers0