23

I am a newbie using Java to do some data processing on csv files. For that I use the multithreading capabilities of Java (pools of threads) to batch-import the csv files into Java and do some operations on each of their lines. On my quad-core, multithreading speeds up the process a lot.

I am curious to know how/whether multiprocessing would speed up the operations even more? If so, is there a tutorial available somewhere? (the Java Basic Tutorial mentions a class, but I am not familiar enough with the syntax to understand the class by myself:

from http://download.oracle.com/javase/tutorial/essential/concurrency/procthread.html:

Most implementations of the Java virtual machine run as a single process. A Java application can create additional processes using a ProcessBuilder object. Multiprocess applications are beyond the scope of this lesson [where are they explained then?].

Tim Bender
  • 20,112
  • 2
  • 49
  • 58
seinecle
  • 10,118
  • 14
  • 61
  • 120
  • 3
    Are you CPU bound or I/O bound? Hard drives are significantly slower than processors. Plus, threads are usually lighter weight to switch between and share data between than processes. If your program is constantly waiting for the disk, it's not going to matter a whole lot either way. – Jonathon Faust Nov 03 '11 at 21:21
  • I have a queue of dozens of csv files to import in my java application. I use a pool of threads (seven threads, precisely) to import them quicker than one after the other - at the moment I can import 7 csv files "at once" - one per thread. Could I speed up this even more with multiprocessing? An how is multiprocessing useful for parallelism on a single computer in general? – seinecle Nov 03 '11 at 21:29
  • Usually I find that you can improve the performance of the single thread much more than the just 4x (the best you can hope for from 4 cores if its CPU bound) I would make sure you have thoroughly profiled and optimised the code your have first. – Peter Lawrey Nov 03 '11 at 22:38
  • I'd be curious to know these tricks - but I'll open a new discussion for that ;-) – seinecle Nov 03 '11 at 22:50

6 Answers6

10

I am curious to know how/whether multiprocessing would speed up the operations even more?

No, in fact it would likely make it worse. If you were to switch from multithreading to multiprocessing, then you would effectively launch the JVM multiple times. Starting up a JVM is no simple effort. In fact, the way the JVM on your desktop machine starts is different from the way an enterprise company starts their JVM, just to reduce wait time for applets to launch for the typical end-user.

Tim Bender
  • 20,112
  • 2
  • 49
  • 58
  • thx Tim... indeed I found other threads of discussions pointing to this. For the interested reader unearthing this discussion later:http://stackoverflow.com/questions/2006035/how-to-create-a-process-in-java and http://www.javabeat.net/tips/8-using-the-new-process-builder-class.html – seinecle Nov 03 '11 at 22:20
  • As soon as I start thinking about multiprocessing, my brain switches over to C/C++ mode where startup cost isn't *that* high. but we are speaking about Java, and it takes a month and a day, plus half your available ram, (might be slightly exaggerated) to startup a new JVM, which each additional process will require. Good point, Tim. – ObscureRobot Nov 03 '11 at 22:25
  • well, I discovered that this thing could be a solution to reduce startup time: http://martiansoftware.com/nailgun/ – seinecle Nov 03 '11 at 22:54
  • Glad you got the information on how to create a subprocess, I should have included that as well, but thought it would be irrelevant. Just an FYI, ProcessBuilder is preferred over Runtime.exec, but it is rarely the case that anybody take advantage of the extra functionality offered by ProcessBuilder. – Tim Bender Nov 03 '11 at 23:59
5

There are several ways to start a new process in Java:

  1. ProcessBuilder.start()
  2. Runtime.exec() works around ProcessBuilder
  3. Apache Commons Exec that works around Runtime.exec()

With ProcessBuilder:

ProcessBuilder pb =
new ProcessBuilder("myCommand", "myArg1", "myArg2");
Map<String, String> env = pb.environment();
env.put("VAR1", "myValue");
env.remove("OTHERVAR");
env.put("VAR2", env.get("VAR1") + "suffix");
pb.directory(new File("myDir"));
File log = new File("log");
pb.redirectErrorStream(true);
pb.redirectOutput(Redirect.appendTo(log));
Process p = pb.start();
assert pb.redirectInput() == Redirect.PIPE;
assert pb.redirectOutput().file() == log;
assert p.getInputStream().read() == -1;

With Runtime:

Runtime r = Runtime.getRuntime();
Process p = r.exec("firefox");
p.waitFor(10, TimeUnit.SECONDS);
p.destroy();

With Apache Commons Exec:

String line = "AcroRd32.exe /p /h " + file.getAbsolutePath();
CommandLine cmdLine = CommandLine.parse(line);
DefaultExecutor executor = new DefaultExecutor();
int exitValue = executor.execute(cmdLine);

Key differences between Multiprocessing and Multithreading from this:

  • The key difference between multiprocessing and multithreading is that multiprocessing allows a system to have more than two CPUs added to the system whereas multithreading lets a process generate multiple threads to increase the computing speed of a system.
  • Multiprocessing system executes multiple processes simultaneously whereas, the multithreading system let execute multiple threads of a process simultaneously.
  • Creating a process can consume time and even exhaust the system resources. However creating threads is economical as threads belonging to the same process share the belongings of that process.
  • Multiprocessing can be classified into symmetric multiprocessing and asymmetric multiprocessing whereas, multithreading is not classified further.

Additional links:

Eugene Lopatkin
  • 2,351
  • 1
  • 22
  • 34
  • Thanks. "Runtime.exec() works around ProcessBuilder" and "Apache Commons Exec that works around Runtime.exec()". Do you mean ProcessBuilder is preferred over Runtime.exec(), which is over Apache Commons Exec? – Tim Apr 13 '19 at 16:45
  • I think it depends on case what you want to do. If you don't want to use Apache lib, you could try Runtime. But if you want to do more complicated things ProcessBuilder could be your choiсe. – Eugene Lopatkin Apr 15 '19 at 05:02
3

Each developer should have some understanding about Amdahl's law to understand how the multi processing would speed up based on the given conditions.

Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized.

This is a good read : Amdahl's law

Amdahl's law

java_mouse
  • 2,069
  • 4
  • 21
  • 30
  • A bit orthogonal to the question, since you can minimize the serial component of your algorithm using threads *or* processes, but worth consideration. – ObscureRobot Nov 03 '11 at 21:40
  • Thanks but honestly... this is so far away from my question. I am asking specifically for recommendations about how to implement multiprocessing in Java. Not about general laws on this topic, really! – seinecle Nov 03 '11 at 21:42
  • You can refer this as well. http://mpc.uci.edu/wget/www.tc.cornell.edu/Services/Edu/Topics/ParProgCons/index.html#sec6 – java_mouse Nov 03 '11 at 21:44
  • the implementation part is mentioned in http://download.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html the application of reading 7 csv files is pretty rudimentary and multithreaded programs are surely more than sufficient. When we are dealing with an enterprise level application.. even there we prefer multithreaded application bcos they are light weight. For understanding multi processing in java I had found a paper earlier – Raveesh Sharma Nov 03 '11 at 21:54
  • http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&sqi=2&ved=0CDwQFjAD&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.114.4925%26rep%3Drep1%26type%3Dpdf&ei=zwyzTs-eHM3bsgaC_eiNBA&usg=AFQjCNHn_ROSQasaZyuDEMKADP4JEIddlQ&sig2=_3K3gWjXSXRkIRkc8JEPTw – Raveesh Sharma Nov 03 '11 at 21:54
  • java_mouse - sorry but your ref is completely off topic - much too general! I am looking for "how to do multiprocessing in java, and what speed gains to expect?". – seinecle Nov 03 '11 at 21:54
  • @RaveeshSharma thx a lot!! This paper is my first lead in the direction of an answer! (btw it is not about reading just 7 csv files, but 60. I plan to scale up to many more files / bigger data volumes in general - hence my enquiry about the benefits of multiprocessing.) – seinecle Nov 03 '11 at 21:57
  • -1 for being off topic. The OP mentions that they are already seeing an improvement with multithreading. – Tim Bender Nov 03 '11 at 22:09
2

For many use cases, multithreading has less overhead than multiprocessing when comparing spawning a thread vs spawning a process as well as comparing communication between threads vs inter-process communication.

However, there are scenarios where multithreading can degrade performance to the point where a single thread outperforms multiple threads, such as cases severely affected by false sharing. With multiprocessing, since each process has its own memory space there is no chance for false sharing to occur and the multiprocessing solution can outperform the multithreading solution.

Overall, some analysis should be conducted when choosing a concurrent programming solution since the best performing solution can vary on a case-to-case basis. Multithreading cannot be assumed to outperform multiprocessing since there are counterintuitive situations where multithreading performs worse than a single thread. When performance is a major consideration, run benchmarks to compare single thread single process vs multithreading vs multiprocessing solutions to ensure you are truly gaining the performance benefits that are expected.

On a quick note, there are other considerations besides performance when choosing a solution.

Hazok
  • 5,373
  • 4
  • 38
  • 48
1

The gain is determined by how long it takes to map/reduce the data.

If, for example, the files are loaded on multiple machines to begin with (think of it like sharding the file system), there's no lag getting the data. If the data is coming from a single location, you're limited by that mechanism.

Then the data has to be combined/aggregated-not knowing more, impossible to guess. If all processing depends on having all data, it's a higher hit than if the ultimate results can be calculated independently.

You have a very small number of very small files: unless what you're doing is computationally expensive, I doubt it'd be worth the effort, but it's difficult to say. Assuming no network/disk bottlenecks you'll get a (very) roughly linear speedup with a delta for aggregating results. The true speedup/delta depends on a bunch of factors we don't know much about at this point.

OTOH, you could set up a small Hadoop setup and just try it and see what happens.

Dave Newton
  • 158,873
  • 26
  • 254
  • 302
0

Check the docs on your JVM to see if it supports multithreading. I'm pretty sure the sun ones do. Java Concurrency In Practice is the place to start for multithreading.

The first part of your question is: is multiprocessing superior to multithreading, from a performance perspective? In a system with robust multithreading support, threads should always be superior to processes, from a performance perspective. There is more isolation between threads (no shared memory, unless explicitly setup via an IPC mechanism), so you might want to go the multiprocess route to keep dangerous threads from stepping on each other.

For data processing, threads should be the best way to go. If threads on your local machine aren't enough, I would skip past a multiprocess solution and go straight to a map-reduce system like Hadoop.

As to why multiprocess apps are mentioned, I think the author wants to be complete. Although a tutorial is not provided, a link to additional documentation is. The big disadvantage of using multiprocessing is that you have to deal with inter process communication. Unlike threads, you can't just share some memory and throw some mutexes around it and call it a day.


From the comments, it appears that there is some confusion about what "multiprocessing" actually is. Threads are constructs that must be created by your code. There are APIs for thread creation and management. Processes, though, can be created by hand on the command line. On a unix box do the following to run four instances (processes) of foo. Note that the final & is required.

$ ./foo & ./foo & ./foo & ./foo &

Now if you have an input file, bar that foo needs to process, use something like split to break it up into four equal segments, and run foo on it:

$ ./foo bar.0 > bar.0.out & ./foo bar.1 > bar.1.out & ./foo bar.2 > bar.2.out & ./foo bar.3 > bar.3.out &

Finally, you will need to combine the bar.?.out files. Running a test like this should give you some feel for whether using heavy-weight processes is a good idea for your application. If you have already built a multi-threaded application, that will probably be just fine. But feel free to run some experiments to see if processes work better. Once you are sure that processes are the way to go, reorganize your code to use ProcessBuilder to spin up the processes yourself.

ObscureRobot
  • 7,306
  • 2
  • 27
  • 36
  • Thanks but this does not answer my question. I already use multithreading (it works well!), and would like to find a source or detailed explanation as to how/why/when multiprocessing would improve performance. Btw, I checked Java Concurrency in Practice: it does not evoke multiprocessing, just multithreading. – seinecle Nov 03 '11 at 21:26
  • @seinecle My guess is: unless you're running into memory/cpu limitations of a single process, and doing some seriously heavy stuff, probably rarely or never. Inter-process communication is gonna gobble up some of the performance gain, and spawning new processes is often somewhat expensive, so it'd only make sense for long-running tasks. One advantage, I guess, is stability. If one process crashes, the rest stays untouched. Google Chrome uses a separate process per tab to make sure sites ruining its day don't take down the whole browser. – G_H Nov 03 '11 at 21:33
  • In a system with robust threading (pretty much any modern Unix or Windows), multithreading is preferred to multiprocessing. The reason is that there is less overhead associated with threads, so you can more quickly spin them up and kill them. You also get shared memory, which is a nice bonus. On older systems, multiprocessing was the way to go. That is why Apache 1.x is multi-process and Apache 2.x is multithreaded, and everyone uses Apache 2 now. – ObscureRobot Nov 03 '11 at 21:33
  • The only reason to use processes over threads I can think of are security and scalability. For reading several csv files neither is important, but then threading may even slow the file reading down.. – Voo Nov 03 '11 at 21:38
  • I tend to be very skeptical with all the comments here. Multhithreading with 7 threads basically speeds up 7 times (minus a tiny bit of ovehead...) my i/o operation (simply: import 60 csv files, each about 5Mb or more). I know that multiprocessing would imply more overhead but it would bring speed gains as well! – seinecle Nov 03 '11 at 21:51
  • So could we stop making guesses and does anybody knows about a book or tutorial explaining multiprocessing (*not just multithreading!*) in java? – seinecle Nov 03 '11 at 21:52
  • What makes you think that multiprocessing would create any speed gains at all? You are going to have to redesign your code to work in separate processes. Then you either fork out a bunch of processes or use the [ProcessBuilder](http://download.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html). Alternately, you could split your input file into N chunks and write some perl or shell to run N copies of your code in parallel. That will quickly tell you whether there is any gain to additional processes. – ObscureRobot Nov 03 '11 at 21:56
  • I've updated my answer again to include details on how to easily experiment with multiprocessing. It isn't the rocket science you think it is! – ObscureRobot Nov 03 '11 at 22:01
  • @seinecle Dude, relax. SO folks are very friendly and helpful. They're usually not content until a question was properly answered once it has generated some interest. Sit it out for a bit and you'll get good stuff like ObscureRobot posted. – G_H Nov 03 '11 at 22:47
  • @seinecle We aren't making guesses. If you can get another speedup through using processes you can get at least the same speedup through using threads. If you have two mechanisms, both of which do exactly the same, but one has a much higher overhead which one do you believe will be faster? Multiple processes do have their advantages over multiple threads, but it's obviously not additional speedup (assuming 64bit at least). If you claim otherwise provide some significant benchmark for that extremely surprising claim (try on linux; on windows the process creation overhead will kill you) – Voo Nov 03 '11 at 23:25