1

I'm building a scientific application in Python and considering using Amazon EC2 to run the process on.

My application is both memory and CPU hungry and would benefit from any resources given to it.

An Extra Large Instance of EC2 gives about 15GB of memory, along with 8 compute units.

My question is, can a single Python script (when run on EC2) take advantage of all 8 compute units? Or must I run 8 independent processes in order to fully take advantage of the 8 compute units?

Note: in case it matters, I plan on using a Linux instance on EC2.

orftz
  • 1,138
  • 13
  • 22
user3262424
  • 7,223
  • 16
  • 54
  • 84

2 Answers2

4

Python has a GIL that makes it complex to write multi-threaded applications that fully utilize more than one core. You can read more about it here How do threads work in Python, and what are common Python-threading specific pitfalls? or here http://www.dabeaz.com/python/UnderstandingGIL.pdf if you're really into the details. I tend to only use Python threads to enable background operation of various tasks (such as communication) rather than for optimal performance.

As Jeremy said, using the multiprocessing module is an alternative option, or you could simply write your script so it works on independent parts of your data, and then start however many copies you prefer.

Community
  • 1
  • 1
Jan
  • 4,366
  • 6
  • 22
  • 21
  • thank you. It looks like it is better to run 8 different processes, each communicating with different parts of the data. This is fine. Yet, all of them will need to share/update the same memory chunk. In `python`, How do I provide all of the 8 processes access to the same chunck of memory? – user3262424 Jun 05 '11 at 18:28
  • "complex to write multi-threaded applications that fully utilize more than one core" Just to clarify, it's easy to write multi-processor scripts and usually impractical to write multi-threaded scripts. – mgoldwasser Jul 08 '14 at 19:03
3

The 8 "compute units" run across 4 physical processors, so a straightforward script would only be able to use 25% of the available power. However, the Python multiprocessing module allows you to write a single script using multiple processes, potentially taking advantage of all of the "compute units".

Jeremy
  • 1
  • 85
  • 340
  • 366
  • @Jeremy Banks: thank you. My application requires that all 'forks' -- while performing their own task -- will communicate / manipulate the same chunk of memory. Is this possible when using `multiprocessing`? – user3262424 Jun 05 '11 at 18:14
  • There's an API to communicate between processes, or by using shared memory. Read http://docs.python.org/library/multiprocessing.html – Jan Jun 05 '11 at 18:21
  • @Jan / @Jeremy: for sharing memory purposes, is it better to use `multiprocessing`, or alternatively, define a linux RAM drive and have all independent processes communicate with it? – user3262424 Jun 05 '11 at 18:42
  • I've never used ram disks for this purpose so I cannot give you a definitive answer, sorry. I think newer Linux kernels offer /dev/shm which is an easy to use 'ramdisk'. Generically speaking, I would think the main difference is in ease-of-use: using the Python API saves you from having to do your synchronization with more manual file I/O. You can also look into Ctypes shared memory, or raw shared memory if that fits you better; there's a lot of options out there and without knowing more about your problem, it's hard for me to be specific. Myself, I'd probably start with the easy way:) – Jan Jun 05 '11 at 19:17
  • @Jan: thank you. is the `easy way` the `python` way? or better just to define one ram disk and have multiple, independent instances (where each one uses a different subset of the data). – user3262424 Jun 05 '11 at 19:50
  • I'd say the easy way and the Python way is identical in this case...i.e. I'd try using the multiprocessing module communication/shared_memory functionality first. – Jan Jun 06 '11 at 10:50