1

I have a ByteArrayOuputStream I received from performing a diff. Java's parsing of this is too slow, so I decided to try to pass the parsing off to a Perl script. I'm having a little trouble getting the script to receive data from this output stream. When I run my code, the application hangs indefinitely. This is what I have so far:

public static Diff analyzeDiff(ByteArrayOutputStream baos) throws IOException {

    ProcessBuilder pb = new ProcessBuilder();
    pb.command("perl/path/perl", TEMP.getAbsolutePath());
    Process process = pb.start();
    OutputStream str = process.getOutputStream();
    baos.writeTo(str);
    str.flush();
    try {
        process.waitFor();
    } catch (InterruptedException e) {
        BufferedReader bf = new BufferedReader(new InputStreamReader(process.getInputStream()));
        String line;
        while ((line = bf.readLine()) != null) {
            System.out.println(line);
        }
    }

    return null;
}

@Test
public void testDiffParser() throws IOException {
    DiffParser.init();

    File test = new File("path/to/file/test.diff");

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    baos.write(FileUtils.readFileToByteArray(test));
    //String output = baos.toString();
    //System.out.println(output);

    DiffParser.analyzeDiff(baos);
    //DiffParser.analyzeDiff(output);
}

And here is my Perl script:

#!/usr/bin/perl
use strict;
use warnings;

my $additions = 0;
my $deletions = 0;
my $filesChanged = 0;

my $fileAdded = 0;
my $line;

foreach $line ( <> ) {
    $_ = $line;
    chomp( $_ );
    print( $_ );
    if ( /^\-\-\-/m ) {
        $fileAdded = 1;
    } elsif ( /^\+\+\+/m && $fileAdded ) {
        $filesChanged++;
        $fileAdded = 0;
    } elsif ( /^\+/ ) {
        $additions++;
        $fileAdded = 0;
    } elsif ( /^\-/ ) {
        $deletions++;
        $fileAdded = 0;
    } else {
        $fileAdded = 0;
    }
}

print("$additions $deletions $filesChanged\n")

Is there a way to actually do what I am trying to do?

Edit: This is how I was doing it in Java:

private Diff parseDiff(final ByteArrayOutputStream baos) {

    final Diff diff = new Diff();

    int filesChanged = 0;
    int additions = 0;
    int deletions = 0;

    boolean fileAdded = false;

    final String[] lines = baos.toString().split("\n");

    for (final String line : lines) {

        if (line.startsWith("---")) {
            fileAdded = true;
        } else if (line.startsWith("+++") && fileAdded) {
            filesChanged++;
            fileAdded = false;
        } else if (line.startsWith("+")) {
            additions++;
            fileAdded = false;
        } else if (line.startsWith("-")) {
            deletions++;
            fileAdded = false;
        } else {
            fileAdded = false;
        }

    }

    diff.additions = additions;
    diff.deletions = deletions;
    diff.changedFiles = filesChanged;

    return diff;
}

Edit 2 If you want some context, you can refer to this Related question

Community
  • 1
  • 1
Himself12794
  • 251
  • 3
  • 13
  • `Java's parsing of this is too slow` Passing it off to perl wouldn't be my first choice (adds an additional dependency/point of failure) - consider posting the java parsing code as there may be ways to optimize – copeg Jul 28 '16 at 23:36
  • @copeg sure, have added that per your suggestion – Himself12794 Jul 28 '16 at 23:41
  • I think multithreading this code would be a better and faster idea than passing off to another process...? – saml Jul 28 '16 at 23:50
  • 2
    How many lines? How did you profile? How slow is this? – copeg Jul 28 '16 at 23:51
  • @copeg I added a link to something else that provides a little context. As for lines, a lot. At times, up to 100,000+ lines to parse. – Himself12794 Jul 29 '16 at 00:26
  • @Sam you might be right, Since I have little experience with multi threading, I hadn't considered that. – Himself12794 Jul 29 '16 at 00:34
  • I just saw 100,000+ lines above. The issue is the number of lines, rather than the speed Java does it. I doubt you would get a substantive benefit using perl. You need to multithread this. Which version of java are you using? – saml Jul 29 '16 at 00:36
  • @Sam I'm using 1.7 for ease of integration with existing components, but I can easily use 1.8 if I thought it would make things easier/faster – Himself12794 Jul 29 '16 at 00:47
  • 1
    No it doesn't make too much of a difference. Java 8 does have some syntax sugar. Take a look around the oracle documentation and some tutorials - like this one http://howtodoinjava.com/core-java/multi-threading/java-thread-pool-executor-example/ and raise questions if you get stuck! :) – saml Jul 29 '16 at 01:09
  • @Sam yes, I'm particularly fond of the stream utility – Himself12794 Jul 29 '16 at 01:11
  • Well one thing you could do is use a parallel stream if you were using Java 8 which would replace your loop and handle the multi threading without you knowing in a ForkJoinThreadPool. This does have some disadvantages since the thread pool is shared for the whole application, but in the this case I don't think that would be a problem. – saml Jul 29 '16 at 01:13

2 Answers2

1

I'm using a tablet at present so I can't help much, but your Perl needs some work.

You shouldn't use for $line ( <> ) as that will try to read all of the input into a list before starting to iterate. You also don't use $line so you should read straight into $_ with

while ( <> ) { ... }

There's also no need to chomp every line, and I dont understand why do you call print for every record? It's after the chomp so the output will be a copy of the input all on one very long line with the aggregate values at the very end.

I suspect the Perl script is receiving the data just fine, but having trouble fitting all of the input into memory at once together with a second copy of everything as output!

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • I confess, before today, I had never written anything in Perl. This is mostly a result of copy-pasting my Java code and modifying it to execute. – Himself12794 Jul 29 '16 at 02:20
  • @Himself12794: I'm pleased to help, but if your Java code is also trying to cache the entire stream, as [Vladimír Schäfer says](http://stackoverflow.com/a/38652955/622310) then you're in trouble. I have few Java skills so I can't help at that end, but I imagine that there is a way to read a stream and pass it to your Perl child process incrementally. Perl will certainly oblige. – Borodin Jul 29 '16 at 14:40
0

Using ByteArrayOutputStream means that the whole result of the diff needs to be stored in memory, all at once, instead of getting processed and garbage collected in chunks. Your Java program may have been slow due to running out of memory and performing garbage collection all the time.

Java will be much faster at whatever task you throw at it, compared to Perl. It's a just-in-time compiled language, as opposed to an interpreted language in case of Perl. See e.g. http://blog.carlesmateo.com/2014/10/13/performance-of-several-languages/, https://attractivechaos.github.io/plb/ or https://en.wikipedia.org/wiki/Java_performance (comparison to other langauges).

If you need performance, you should optimize your Java code instead of creating dependency on Perl.

Vladimír Schäfer
  • 15,375
  • 2
  • 51
  • 71
  • Unfortunately, that's the format in which I receive the data. – Himself12794 Jul 29 '16 at 12:17
  • Tribalism isn't welcome here. ***"Java will be much faster at whatever task you throw at it, compared to Perl"*** That requires a citation. I believe you are talking nonsense. The main reasons for using Java have always been its cross-platform performance and its open-source origins, and [Oracle are changing that](http://arstechnica.com/information-technology/2016/07/how-oracles-business-as-usual-is-threatening-to-kill-java/). Beyond that, its level of abstraction will make it more or less efficient than Perl, depending on the task at hand. Its demands for formal typing make it less popular. – Borodin Jul 29 '16 at 14:48
  • Check the benchmarks above and throw at it experience in processing of billions of records a day on both platforms. – Vladimír Schäfer Jul 29 '16 at 15:03
  • Take a look at [*Perl, Python, Ruby, PHP, C, C++, Lua, Tcl, JavaScript and Java comparison*](http://raid6.com.au/~onlyjob/posts/arena/), where Java comes off by far the worst for a simple string concatenation and regex substitution task. – Borodin Jul 30 '16 at 14:01
  • The Java code in that benchmark is compiling the regexp on every call, instead of using Pattern, that makes it slow. See http://stackoverflow.com/questions/6262397/string-replaceall-is-considerably-slower-than-doing-the-job-yourself – Vladimír Schäfer Jul 30 '16 at 18:13
  • It turned out that the majority of the slowdown was not coming from the parsing the text, but from the diff operation itself. In this case, it happens that the delta between Perl and Java parsing is not significant enough to warrant passing it off to another process. – Himself12794 Aug 09 '16 at 16:20