3

I have written a Perl code to process huge number of CSV files and get output, which is taking 0.8326 seconds to complete.

my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;

open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
    my $line = $_;
    chomp $line;

    my $severity = (split(",", $line))[6];
    next if $severity =~ m/NORMAL/i;
    $hash{$time}{$severity}++;
}
close(IN);

}
foreach my $time (sort {$b <=> $a} keys %hash) {
    foreach my $severity ( keys %{$hash{$time}} ) {
        print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
    }
}

Now I'm writing the same logic in Java, which I wrote but taking 2600ms i.e 2.6 sec to complete. My question is why Java is taking so long time? How to achieve the same speed as Perl? Note: I ignored the VM initialization and class loading time.

    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileFilter;
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.TreeMap;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 
        static String opname;
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    String severity=line.split(",")[6];
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");  
            System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            opname = args[0];
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");
        }

    }

File input format(A~B~C~D~E~20150715080000.csv),around 500 files of ~1MB each,

A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G

Java Version: 1.7

////////////////////Update///////////////////

As per the below comments , I replaced the split with regex , and the performance is improved a lot. Now I am doing this in a loop , and after 3-10 iteration the performance is quite acceptable .

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>(); 
        static String opname="Etis_Egypt";
        static Pattern pattern1=Pattern.compile("(\\d+\\.)");
        static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
        static long currentsystime=System.currentTimeMillis();
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                Matcher matcher=pattern1.matcher(mf.getName());
                matcher.find();
                //String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                String timestamp=matcher.group();
                BufferedReader br= new BufferedReader(new FileReader(mf));
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    matcher=pattern2.matcher(line);
                    matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
                    //String severity=line.split(",")[6];
                    String severity=matcher.group();
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
                br.close();
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            //System.out.println(time+"ms");    
            //System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            //opname = args[0];
            for(int i=0;i<20;i++){
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;


            System.out.println("Time taken for "+i+" is "+time+"ms");
            }
        }

    }

But I have another question now ,

See the result while running on a small dataset,.

**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms

For first few instance time taken is more , and then its reduced ,.. Why ???

Thanks ,

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
RBanerjee
  • 957
  • 1
  • 9
  • 18
  • **Note: The app is performance critical.**Hence 2.6ms matters. – RBanerjee Jul 15 '15 at 09:26
  • You should really use a library for parsing CSV. I can't say anything about performance but I recommend [OpenCSV](http://opencsv.sourceforge.net). – Kai Jul 15 '15 at 09:28
  • 3
    Same goes for Perl. https://metacpan.org/pod/Text::CSV will be much safer than your own implementation. – simbabque Jul 15 '15 at 09:34
  • 2
    perl is basically a text processing purpsoe language. they developed text processing in mind – Raghavendra Jul 15 '15 at 09:38
  • 2
    There is a lot you can do to make that Perl code go faster! – Borodin Jul 15 '15 at 09:50
  • The first thing you want to do is use a `Pattern` instead of calling `.split(",")` on `String` instances; given how you currently do it, a `Pattern` is created for each processed line of the CSV – fge Jul 15 '15 at 09:58
  • I cant use OpenCSV. , because it will not only be a CSV the file could be even XML, but process should be within ms.Process will not be with a complex logic ,.but the IO will be huge. Why Java is Soo Slow ??? – RBanerjee Jul 15 '15 at 10:51
  • 1
    [`Text::CSV` may be safer but it is probably slower than your existing impl](http://stackoverflow.com/questions/13916962) – mob Jul 15 '15 at 13:35
  • These 2.6 ms, is this the time you measured? How do you get 2.6 when dealing with integers? How do you ignore class loading time, when classes get loaded as needed? – maaartinus Jul 15 '15 at 15:36
  • @maaartinus , my mistake for Perl its 0.8 s and for java its 2600ms i.e 2.6 s. I corrected that. And I have used only one class and used currenttimeinmilies to measures the time taken by the actual logic(which is same for java and perl) on this same class . please have a look in my java code. And that means the VM is already initialized and the class is loaded , then system.currenttimeinmilis called. Hope it clarifies your question. – RBanerjee Jul 15 '15 at 16:14
  • 1
    OK, this makes more sense. See my answer. Your Java is far from nice, if you're interested in making the code better, post it on [code review](http://codereview.stackexchange.com/questions/tagged/java). Please, clean it up a bit first. – maaartinus Jul 15 '15 at 19:20

2 Answers2

4

A few seconds are not enough for Java to get to its full speed because of JIT compilation. Java is optimized for servers running for hours (or years), not for tiny utilities taking just a few seconds.

Concerning class loading, I guess you don't know about e.g. Pattern and Matcher which you use indirectly in split and which get loaded as needed.


static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 

A Perl hash is most like a Java HashMap, but you're using a TreeMap which is slower. I guess this doesn't matter, just note that there are way more differences than you think.


 int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);

You're reading the time for each file again and again. You're doing it even for those whose name doesn't end with ".csv". That's surely not what find does.


String timestamp=mf.getName().split("~")[5].replace(".csv", "");

Unlike Perl, Java doesn't cache regexes. As far as I know the split on a single character gets optimized separately, but otherwise you'd be much better with using something like

private static final Pattern FILENAME_PATTERN =
    Pattern.compile("(?:[^~]*~){5}~([^~]*)\\.csv");

Matcher m = FILENAME_PATTERN.matcher(mf.getName());
if (!m.matches) ... do what you want
String timestamp = m.group(1);

 BufferedReader br = new BufferedReader(new FileReader(mf), 1024*500);

This could be the culprit. By default, it uses platform encoding, which may be UTF-8. This is usually slower than ASCII or LATIN-1. As far as I know Perl works directly with bytes unless instructed otherwise.

The buffer size of half a megabyte is insanely big for anything taking just a few seconds, especially when you allocate it multiple times. Note that there's nothing like this in your Perl code.


That all said, Perl and find might be indeed faster for such short tasks.

Borodin
  • 126,100
  • 9
  • 70
  • 144
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • You make some good points. I would add that the OP has implemented the Perl code exactly, including making `store` (`%hash` in Perl) a hash of hashes. The corresponding Java TreeMap of HashMaps is far less obvious and presumably much slower than the Perl original. Perl buffers input streams in an 8KB buffer by default, but its `split` doesn't use the regex engine. I don't know how `FileFilter` works, but I believe Perl can emulate the filter quicker than shelling out to `find`. In the end I think it is down to the OP to present what he has. I haven't seen a good case for a rewrite – Borodin Jul 15 '15 at 19:38
  • Thanks for your answer, it make sense. I would like to add few things, 1. I got the class loading part, when its required , the class is getting loaded. 2. Treemap part, actually in Perl also I am sorting the map ,hence treemap in java., and with hashmap also no improvement. – RBanerjee Jul 15 '15 at 20:08
  • 3.The splits internal regex part is new for me.4.Initially I used default buffer size for reader, but then I though maybe due to more disk IOs the performance is slow, hence I increased it. But nothing changed in terms of timings – RBanerjee Jul 15 '15 at 20:09
  • 5. Even inside the test read method after filefilter I added timings , which shows, the main time is taken by while loop. @maaartinus, you also told java is good for long run, yes this will be a job which will run for ever , my question is if it run forever then will it improve the performance from 2nd 3rd run onwards??? – RBanerjee Jul 15 '15 at 20:13
  • Also the readers aren't being closed so there's 500 Readers out there taking up memory – dkatzel Jul 15 '15 at 20:21
  • 1
    @user3080158 I'd bet, it'll get better when iterating multiple times. A typical benchmark makes 5-20 throw-away iterations before it starts measuring the time. The question is how good it gets. Without closing the readers, you'll run out of file descriptors soon. +++ If you could provide the data, someone could try harder to optimize. – maaartinus Jul 15 '15 at 22:04
  • @maaartinus,thanks it worked !!See the result while running on a small dataset,. **Time taken for 0 is 218ms Time taken for 1 is 134ms Time taken for 2 is 127ms** Time taken for 3 is 98ms Time taken for 4 is 90ms Time taken for 5 is 77ms Time taken for 6 is 71ms Time taken for 7 is 72ms Time taken for 8 is 62ms Time taken for 9 is 57ms Time taken for 10 is 53ms Time taken for 11 is 58ms Time taken for 12 is 59ms Time taken for 13 is 46ms Time taken for 14 is 44ms Time taken for 15 is 45ms **For first few instance time taken is more , and then its reduced ,.. Why ???** – RBanerjee Jul 16 '15 at 10:00
  • 1
    @RBanerjee That's JIT compilation. At first, the code gets interpreted and stats get collected. Concurrently, a simple compiler (C1) runs and produces some medium quality code, which gets used when ready. Then, a better compiler (C2) runs to produce highly optimized code. And this all holds for each relevant part of the code (parts executed just a few times usually need no compilation). It's actually a bit more complicated (google out OSR or deoptimization). – maaartinus Jul 16 '15 at 12:24
  • @maaartinus, I got the answer. A lot of learning for me.Thanks!! – RBanerjee Jul 16 '15 at 15:25
0

One obvious thing: use of split() will slow you down. According to JDK source code I can find online, Java does not cache compiled regexps (please correct me if I am wrong).

Make sure you use pre-compiled regexps in your Java code.