I have written a Perl code to process huge number of CSV files and get output, which is taking 0.8326 seconds to complete.
my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;
open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
my $line = $_;
chomp $line;
my $severity = (split(",", $line))[6];
next if $severity =~ m/NORMAL/i;
$hash{$time}{$severity}++;
}
close(IN);
}
foreach my $time (sort {$b <=> $a} keys %hash) {
foreach my $severity ( keys %{$hash{$time}} ) {
print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
}
}
Now I'm writing the same logic in Java, which I wrote but taking 2600ms i.e 2.6 sec to complete. My question is why Java is taking so long time? How to achieve the same speed as Perl? Note: I ignored the VM initialization and class loading time.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>();
static String opname;
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
String timestamp=mf.getName().split("~")[5].replace(".csv", "");
BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
String severity=line.split(",")[6];
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
System.out.println(store);
}
public static void main(String[] args) throws IOException
{
opname = args[0];
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
}
}
File input format(A~B~C~D~E~20150715080000.csv),around 500 files of ~1MB each,
A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G
Java Version: 1.7
////////////////////Update///////////////////
As per the below comments , I replaced the split with regex , and the performance is improved a lot. Now I am doing this in a loop , and after 3-10 iteration the performance is quite acceptable .
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>();
static String opname="Etis_Egypt";
static Pattern pattern1=Pattern.compile("(\\d+\\.)");
static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
static long currentsystime=System.currentTimeMillis();
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
Matcher matcher=pattern1.matcher(mf.getName());
matcher.find();
//String timestamp=mf.getName().split("~")[5].replace(".csv", "");
String timestamp=matcher.group();
BufferedReader br= new BufferedReader(new FileReader(mf));
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
matcher=pattern2.matcher(line);
matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
//String severity=line.split(",")[6];
String severity=matcher.group();
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
br.close();
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
//System.out.println(time+"ms");
//System.out.println(store);
}
public static void main(String[] args) throws IOException
{
//opname = args[0];
for(int i=0;i<20;i++){
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println("Time taken for "+i+" is "+time+"ms");
}
}
}
But I have another question now ,
See the result while running on a small dataset,.
**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms
For first few instance time taken is more , and then its reduced ,.. Why ???
Thanks ,