8

I have a java program that uses ProcessBuilder to call the unix sort command. When I run this code within my IDE (intelliJ) it only takes about a second to sort 500,000 lines. When I package it into an executable jar, and run that from the terminal it takes about 10 seconds. When I run the sort command myself from the terminal, it takes 20 seconds!

Why the vast difference in performance and any way I can get the jar to execute with the same performance? Environment is OSX 10.6.8 and java 1.6.0_26. The bottom of the sort man page says "sort 5.93 November 2004"

The command it is executing is:

sort -t'    ' -k5,5f -k4,4f -k1,1n /path/to/imput/file -o /path/to/output/file

Note that when I run sort from the terminal I need to manually escape the tab delimiter and use the argument -t$'\t' instead of the actual tab (which I can pass to ProcessBuilder).

Looking as ps everything seems the same except when run from IDE the sort command has a TTY of ?? instead of ttys000--but from this question I don't think that should make a difference. Perhaps BASH is slowing me down? I am running out of ideas and want to close this 20x performance gap!

Community
  • 1
  • 1
Aaron Silverman
  • 22,070
  • 21
  • 83
  • 103
  • wow .. I think I saw someone else ask this same question yesterday. http://stackoverflow.com/questions/7111127/why-is-my-application-running-faster-in-intellij-compared-to-command-line – Kal Aug 19 '11 at 16:23
  • 1
    Do you know that you are running the same sort? Try an absolute path to the executable to be sure. If you have brew/macports/fink installed, it is possible that the sort from those packages are being run when it's slower. – ergosys Aug 19 '11 at 16:46
  • @Zugwalt : how 'wide' is each record, or put another way, how big is the overall file you are sorting. 500,000 of a normal record in 1 second sounds right for the the Unix systems I'm used to working on. 20 seconds seems insane. The sort will build its temp files in /tmp or /var/tmp dir (unless you are overriding it with `-D`). Maybe you can pickup a clue there by watching the processing. Otherwise, I'm thinking problems with disks, is your IDE writing tmpfiles to a different place than the standard /tmp/ or /var/tmp? Good luck. – shellter Aug 19 '11 at 16:50
  • How do you know it's slower? Did you take into account JVM startup/shutdown time? Maybe your console is blocking writes as it catches up drawing on the screen and, therefore, your app is mainly waiting for I/O? – Kaleb Pederson Aug 19 '11 at 16:52
  • @Kal -- good memory! He works on my team and has passed the issue on to me. I dived in more and wanted to present my findings and the issue in a less code heavy question. – Aaron Silverman Aug 19 '11 at 17:05
  • @ergosys, I modified it to use the absolute /usr/bin/sort with same results. – Aaron Silverman Aug 19 '11 at 17:05
  • @shellter each row is about 350 characters wide. I will look in and see if different temp directories but machine has SSD. – Aaron Silverman Aug 19 '11 at 17:06
  • @Kaleb I put in outputs before and after the sort and for the terminal I used the time command--I doubt it would buffer for 10 extra seconds. Thanks everybody for the comments so far! – Aaron Silverman Aug 19 '11 at 17:06
  • You could also try calling `bash execsort.sh` from java. – toto2 Aug 19 '11 at 17:20
  • @Zugwalt Console2 on windows has buffering/display problems for bulk output. A quick test would be to write the output to file instead of console. The time command includes JVM startup and shutdown time. Modifying your program slightly to record actual start and end time would be more accurate. – Kaleb Pederson Aug 19 '11 at 17:46
  • @Kaleb I did modify the program to record actual start and end times, I only used the time command when not using the JVM and running the command myself – Aaron Silverman Aug 19 '11 at 18:23
  • @toto2 I have tried passing sort as the command for /bin/sh to execute using the -c argument as well as running it directly with no effect. – Aaron Silverman Aug 19 '11 at 18:25
  • @Zugwalt Your answer is not clear. You tried running from the shell with "/bin/sh /bin/sort" and it was very slow? And you tried from Java code (IntelliJ) calling "/bin/sh /bin/sort" and it was very fast? – toto2 Aug 19 '11 at 20:18
  • @Zugwalt : Did you get this resolved? We'd be very interested to find out your solution. Good luck. – shellter Aug 22 '11 at 15:11
  • @shellter not resolved yet--time for a bounty! – Aaron Silverman Aug 22 '11 at 15:34
  • I'm confused by your post. For the cmd-line problem, are you using `sort -t' '` (with tab char) OR `... -t$'\t' ...`? Did you try `$"\t"`? Have you tried process of elimination testing, by sorting same file on other machines? Sorting different files on same machine, Making sure there are no high CPU processes running? Good luck. – shellter Aug 22 '11 at 15:45

2 Answers2

12

I'm going to venture two guesses:

  • perhaps you are invoking different versions of sort (do a which sort and use the full absolute path to recompare?)

  • perhaps you are using more complicated locale settings (leading to more complicated character set handling etc.)? Try

     export LANG=C
     sort -t'    ' -k5,5f -k4,4f -k1,1n /input/file -o /output/file
    

to compare

sehe
  • 374,641
  • 47
  • 450
  • 633
  • 2
    Your second guess was it. By calling processBuilder.environment().put("LANG","C") before processBuilder.start() we saw the same performance run from the shell (which was defaulting to en_US:UTF-8) as the IDE. This read also gives some numbers revealing en_US:UTF-8 being almost 10x as slow as C and 5x as slow as en_US http://computing.fnal.gov/unix-users/tips/Lang_Tips.html – Aaron Silverman Aug 22 '11 at 18:10
  • 1
    Bravo @sehe! Learned something valuable here. Also an upvote to @Zugwalt for including the link to Lang_Tips.html. Excellent! – shellter Aug 22 '11 at 22:40
  • Thanks all for your support! It's nice to get so lucky in guessing the cause sometimes :) – sehe Aug 23 '11 at 07:27
  • Who would have thought it was due to environment stuff :) – Chris Dennett Aug 30 '11 at 16:13
0

Have a look at this project: http://code.google.com/p/externalsortinginjava/

Avoid the need of calling external sort entirely.

Chris Dennett
  • 22,412
  • 8
  • 58
  • 84