2

I just started using java so sorry if this question's answer is obvious. I can't really figure out how to share variables in java. I have been playing around with python and wanted to try to port some code over to Java to learn the langauge a bit better. Alot of my code is ported but I'm unsure how exactly multiprocessing and sharing of variables works in Java(my process is not disk bound, and uses alot of cpu and searching of a list).

In Python, I can do this:

from multiprocessing import Pool, Manager
manager = Manager()
shared_list = manager.list()
pool = Pool(process=4) 
for variables_to_send in list_of_data_to_process:
       pool.apply_async(function_or_class, (variables_to_send, shared_list))
pool.close()
pool.join()

I've been having a bit of trouble figuring out how to do multiprocessing and sharing like this in Java. This question helped me understand a bit(via the code) how implementing runnable can help and I'm starting to think java might automatically multiprocess threads(correct me if I'm wrong on this I read that once threads exceed capacity of a cpu they are moved to another cpu? The oracle docs seem to be more focused on threads than multiprocessing). But it doesn't explain how to share lists or other variables between proceses(and keep them in close enough sync).

Any suggestions or resources? I am hoping I'm searching for the wrong thing(multiprocessing java) and that this is hopefully as easy(or similarly straightforward) as it is in my above code.

Thanks!

Community
  • 1
  • 1
Lostsoul
  • 25,013
  • 48
  • 144
  • 239

3 Answers3

3

There is an important difference between a thread and a process, and you are running into it now: with some exceptions, threads share memory, but processes do not.

Note that real operating systems have ways around just about everything I'm about to say, but these features aren't used in the typical case. So, to fire up a new process, you must clone the current process in some way with a system call (on *nix, this is fork()), and then replace the code, stack, command-line arguments, etc. of the child process with another system call (on *nix, this is the exec() family of system calls). Windows has rough equivalents of both these system calls, so everything I'm saying is cross-platform. Also, the Java Runtime Environment takes care of all these system calls under the covers, and without JNI or some other interop technology you can't really execute them yourself.

There are two important things to note about this model: the child process doesn't share the address space of the parent process, and the entire address space of the child process gets replaced on the exec() call. So, variables in the parent process are unavailable to the child process, and vice versa.

The thread model is quite different. Threads are kind of like lite processes, in that each thread has its own instruction pointer, and (on most systems) threads are scheduled by the operating system scheduler. However, a thread is a part of a process. Each process has at least one thread, and all the threads in the process share memory.

Now to your problem:

The Python multiprocessing module spawns processes with very little effort, as your code example shows. In Java, spawning a new process takes a little more work. It involves creating a new Process object using ProcessBuilder.start() or Runtime.exec(). Then, you can pipe strings to the child process, get back its output, wait for it to exit, and a few other communication primitives. I would recommend writing one program to act as the coordinator and fire up each of the child processes, and writing a worker program that roughly corresponds to function_or_class in your example. The coordinator can open multiple copies of the worker program, give each a task, and wait for all the workers to finish.

Adam Mihalcin
  • 14,242
  • 4
  • 36
  • 52
  • When I run the python code above I am creating multiprocessors(in the above example, those are unique proceses sharing the same list. Is there a way to achieve the same result in Java as easily? – Lostsoul Feb 24 '12 at 01:17
  • The Java Virtual Machine runs as a single process and manages threads internally. You cannot spawn "processes" within the JVM. However that is not necessary to make use of multiple processors. – Usman Ismail Feb 24 '12 at 01:29
  • @Lostsoul You might be able to pass a list between processes using RMI or some other interprocess communication mechanism, but as I say in my answer Java doesn't provide anything nearly as easy to use as Python's multiprocessing. The problem with RMI is that serialization and deserialization, as handled by the RMI library, can get very expensive. The coordinator approach I recommend forces you to write your task description to the worker's stdin by writing to the return value of `Process.getInputStream()`, but unless each child is using the whole list this can be more efficient. – Adam Mihalcin Feb 24 '12 at 01:31
  • @UsmanIsmail You can spawn processes within the JVM. No, you cannot spawn true processes within *a single JVM*, but I can call `new ProcessBuilder("python", "script.py").start()` or `new ProcessBuilder("java", "-jar", "program.jar").start()` if I so choose. – Adam Mihalcin Feb 24 '12 at 01:33
  • The reason python uses lightweight processes instead of threads is because of limitations in its implementation (GIL). In java it's much more sensible to just use threads to begin with - *especially* if we have to share memory anyhow as in the given example. Just porting code from one language to the other without respecting the differences between the two languages just results in inferior, harder to maintain code. – Voo Feb 24 '12 at 01:38
  • Thank you so much Adam. I'll be honest, I still don't fully understand how to do it in Java(seems alot more indirect than python) but your logic and links are really a great starting point for me. I'm dealing with a very large list that is being mostly read(but also updated) by these sub proceses and piping strings does not seem like a really efficient way to share state but if I have to then I will. – Lostsoul Feb 24 '12 at 01:40
  • @Lostsoul Since you need to share state, RMI (see http://en.wikipedia.org/wiki/Java_remote_method_invocation for an overview) starts to make more sense. You can have a single server process, which may or may not be the same as the coordinator, and the workers could act as client processes (all within a single machine, though). In that way, you could keep the shared state in the server process. – Adam Mihalcin Feb 24 '12 at 01:45
  • Thanks Adam! I'm starting to realize this is pretty complicated but I want to get this done so I'll check all options out. I found a few tutorials on RMI, I'll go through it and play around a bit. Actually I have a hadoop/hbase cluster avail to me(well its my VM on my desktop) do you think that could help me with multiprocessing/sharing data? – Lostsoul Feb 24 '12 at 01:51
  • @Lostsoul Oh, silly me - I forgot to mention that you could just use threads instead of multiple processes and RMI. However, synchronization gets a bit tricky, so make sure to study up on Java locking (and concurrency control in general). – Adam Mihalcin Feb 24 '12 at 01:53
  • @Lostsoul Hadoop is an implementation of the MapReduce pattern. If your problem can be solved by this pattern (see http://en.wikipedia.org/wiki/Mapreduce for more), you may be able to leverage Hadoop to make development easier. However, Hadoop and HBase don't really shine on a single machine - they were designed to split your data-intensive computation across multiple data centers. – Adam Mihalcin Feb 24 '12 at 01:55
  • 1
    @AdamMihalcin True about spawning processes from Java but why would you need too. A single JVM with lots of memory and well managed threads is as good as separate processes? Voo has the right idea tackling this problem in Java does not require processes. – Usman Ismail Feb 24 '12 at 01:55
  • 1
    @UsmanIsmail To know "as good as" for sure, you have to develop multiple solutions and compare performance between them. As long as you are careful about concurrency control, I agree that a single JVM is often better than separate processes. – Adam Mihalcin Feb 24 '12 at 02:00
  • Thanks so much guys for all your help, I will start digging. Not trying to be a troll by why isn't there some kind of standard way of doing this? I'm sure there's people in production environments or research that do very heavily math functions, its hard to imagine them limiting their code to only one core while the other cores are totally free. – Lostsoul Feb 24 '12 at 02:04
  • @Lostsoul The standard way is to use multiple threads, and math-heavy research usually uses something other than Java :) – Adam Mihalcin Feb 24 '12 at 02:05
  • @Lostsoul Because the standard way to use multiple cores in Java (and really most languages) is to use threads, not processes. Processes have several disadvantages: They're more expensive to create, *much* more expensive to share data between them and you get higher memory pressure. The only reason cpython uses processes instead of threads, is because of a limitation in its implementation - it's a workaround for not having real threads, not the other way round (look up "Global interpreter lock" in the python docs) – Voo Feb 24 '12 at 02:28
  • I'll look at threads also. Right now my program takes around 100% of one core but the other 3 are doing nothing. I'll start with threads, then RMI, then the processBuilder library above. Thanks everyone, you have given me a great starting point and hopefully I'll learn enough to come back with smarter questions :-) – Lostsoul Feb 24 '12 at 02:40
1

You can use Java Thread for this purpose. You need to create one user defined class. That class should have setter method through which you can set shared_list object. Implement Runnable interface and perform processing task in run() method. You can find good example on internet. If you are sharing the same instance of shared_list then you need to make sure that access to this variable is synchronized.

JProgrammer
  • 1,135
  • 1
  • 10
  • 27
1

This is not the easiest way to work with threads in java but its the closed to the python code you posted. The task class is an instance of the callable interface and it has a call method. When we create each of the 10000 Task instances we pass them a reference to the same list. So when the call method of all those objects is called they will use the same list.

We are using a fixed size thread pool of 4 threads here so all the tasks we are submitting get queued and wait for a thread to be available.

public class SharedListRunner {
    public void RunList() {
        ExecutorService executerService = Executors.newFixedThreadPool(4);
        List<String> sharedList = new List<String>();
        sharedList.add("Hello");
        for(int i=0; i < 10000; i++)
            executerService.submit(new Task(list));
    }
}

public class Task implements Callable<String> {

    List<String> sharedList;    

    public Task(List<String> sharedList) {
            this.sharedList = sharedList;
    }

    @Override
    public String call() throws Exception {
            //Do something to shared list
            sharedList.size();  
            return "World";
    }
}

At any one time 4 threads are accessing the list. If you want to dig further 4 Java threads are accessing the list, There are probably fewer OS threads servicing those 4 java threads and there are even fewer processor threads normally 2 or 4 per core of your cpu.

Usman Ismail
  • 17,999
  • 14
  • 83
  • 165
  • Hi USman...I'm very new to java(actually started today), how do I run those classes? I know this is basic but I tried to put them in a file named task.java and it didn't work and neither did a file called SharedListRunner. I guess I have to call them from another file but not sure what I need to do to call them(I know I can import both files but what then?) – Lostsoul Feb 24 '12 at 02:50
  • The class name has to match the file name so you will have two files: SharedListRunner.java and Task.java I have also skipped the import declaration http://leepoint.net/notes-java/language/10basics/import.html Please go through a few basic tutorials (http://www.javacoffeebreak.com/java101/java101.html) before attempting multi-threaded programs there are some intricacies here. – Usman Ismail Feb 24 '12 at 15:15