8

I have been tasked with reading large CSV files (300k+ records) and apply regexp patterns to each record. I have always been a PHP developer and never really tried any other languages, but decided I should take the dive and attempt to do this with Java which I assumed would be much faster.

In fact, just reading the CSV file line by line was 3x faster in Java. However, when I applied the regexp requirements, the Java implementation proved to take 10-20% longer than the PHP script.

It is very well possible that I did something wrong in Java, because I just learned this as I went today. Below are the two scripts, any advice would be greatly appreciated. I really would like to not give up on Java for this particular project.

PHP CODE

<?php
$bgtime=time();
$patterns =array(
    "/SOME REGEXP/",
    "/SOME REGEXP/",                    
    "/SOME REGEXP/",    
    "/SOME REGEXP/" 
);   

$fh = fopen('largeCSV.txt','r');
while($currentLineString = fgetcsv($fh, 10000, ","))
{
    foreach($patterns AS $pattern)
    {
        preg_match_all($pattern, $currentLineString[6], $matches);
    }
}
fclose($fh);
print "Execution Time: ".(time()-$bgtime);

?>

JAVA CODE

import au.com.bytecode.opencsv.CSVReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;

public class testParser
{
    public static void main(String[] args)
    {
        long start = System.currentTimeMillis();


        String[] rawPatterns = {
                    "SOME REGEXP",
                    "SOME REGEXP",                    
                    "SOME REGEXP",    
                    "SOME REGEXP"    
        };

        ArrayList<Pattern> compiledPatternList = new ArrayList<Pattern>();        
        for(String patternString : rawPatterns)
        {
            Pattern compiledPattern = Pattern.compile(patternString);
            compiledPatternList.add(compiledPattern);
        }


        try{
            String fileName="largeCSV.txt";
            CSVReader reader = new CSVReader(new FileReader(fileName));

            String[] header = reader.readNext();
            String[] nextLine;
            String description;

            while( (nextLine = reader.readNext()) != null) 
            {
                description = nextLine[6];
                for(Pattern compiledPattern : compiledPatternList)
                {
                    Matcher m = compiledPattern.matcher(description);
                    while(m.find()) 
                    {
                        //System.out.println(m.group(0));
                    }                
                }
            }
        }

        catch(IOException ioe)
        {
            System.out.println("Blah!");
        }

        long end = System.currentTimeMillis();

        System.out.println("Execution time was "+((end-start)/1000)+" seconds.");
    }
}
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
IOInterrupt
  • 539
  • 1
  • 4
  • 14
  • Not entirely related to your regex problem, but you might want to look at http://download.oracle.com/javase/6/docs/api/java/util/Scanner.html. You might find your CSVReader class is not needed. Not creating all of those temporary Strings (`nextLine` appears to have at least 7 Strings, but you only need one) might improve performance. – wolfcastle Jul 11 '11 at 21:13

5 Answers5

4

Using a buffered reader might help performance get quite a bit better:

CSVReader reader = new CSVReader(new BufferedReader(new FileReader(fileName)));
rsp
  • 23,135
  • 6
  • 55
  • 69
3

I don't see anything glaringly wrong with your code. Try isolating the performance bottle-neck using a profiler. I find the netbeans profiler very user-friendly.

EDIT: Why speculate? Profile the app and get a detailed report of where the time is spent. Then work to resolve the inefficient areas. See http://profiler.netbeans.org/ for more information.

EDIT2: OK, I got bored and profiled this. My code is identical to yours and parsed a CSV file with 1,000 identical lines as follows:

SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP

Here are the results (obviously your results will differ as my regular expressions are trivial). However, it's plain to see that the regex processing is not your main area of concern.

enter image description here

Interestingly, if I apply a BufferedReader, the performance is enhanced by a whopping 18% (see below).

enter image description here

hoipolloi
  • 7,984
  • 2
  • 27
  • 28
  • Just whipped this in Notepad++, but I will give netbeans a go and see what it indicates. – IOInterrupt Jul 11 '11 at 21:13
  • Apparently I do not know how to utilize the profiler effectively. I've run the profiler against my JAVA application, but all it seems to show me is the Memory(Heap), Memory(GC), and Threads/Loaded Classes...along with the execution time of main(). Are there any good tutorials on how to utilize this? – IOInterrupt Jul 11 '11 at 22:00
  • @IOInterrupt - In Netbeans, Profile > Profile Main Project > CPU > Entire Application > Run – hoipolloi Jul 11 '11 at 22:26
  • Thank you! I found that I had to set the filter to "Profile all classes" for it to show the Live Results. I will be digging deeper into this for sure so I understand exactly what the issue is. Thank you again. – IOInterrupt Jul 12 '11 at 13:22
0

A few points to be noted here.

  1. You start measuring the time even before the patterns are compiled. Pattern.compile is a relatively expensive operation and may consume more time if the pattern is complex. Why not start measuring it after the compilation step?

  2. I'm not sure how efficient CSVReader class is.

  3. Rather than directly printing the matched results in the main thread itself, (as System.out.println is blocking and expensive) you could perhaps delegate printing to a different thread.

adarshr
  • 61,315
  • 23
  • 138
  • 167
  • 1
    I prefer to time the script from inception, because both scripts are essentially doing the same thing and I think total script execution time is a valid indicator. I thought that the complexity of the regular expression was the issue so I changed them all to be a single common word. PHP execution time was 93 seconds vs Java's 246 seconds. I believe the CSVReader class is efficient, because it was able to read the CSV file much faster (3x faster) than PHP fgetcsv() function. In addition, I have commented out the println() function. – IOInterrupt Jul 11 '11 at 21:14
  • @IOInterrupt: Right. There are a number of factors that can come to play here. The amount of memory input to the VM also plays a major role. You could try profiling the application as hoipolloi suggested. – adarshr Jul 11 '11 at 21:17
  • I am quite ignorant when it comes to the VM. I was just happy to get this thing working. – IOInterrupt Jul 11 '11 at 21:28
0

Several things:

  1. The regex has to be compiled only once and that should be at the startup of the server so doesn't really matter for the performance while its running.

  2. And most importantly you're writing a completely invalid benchmark for a long running java program. You're most certainly loading several classes while benchmarking and overall only testing the interpreter's performance and NOT the JIT which will obviously result in much worse performance. See this excellent post for how to write a valid benchmark in java. Most certainly this will remedy all alleged performance problems in this case.

Community
  • 1
  • 1
Voo
  • 29,040
  • 11
  • 82
  • 156
  • 2
    The OP didn't say there was a server, nor that it was a long running program. It might be the case, and you would then be right, but it might not. – JB Nizet Jul 11 '11 at 21:18
  • 1
    Assumed it was a server because he was using PHP, but yes you're right. But if the program isn't long running and not performance critical why the hell would one care about optimizing it? – Voo Jul 11 '11 at 21:19
  • The PHP script is ran manually via the PHP CLI just as the JAVA app is being run manually. I am not sure if this makes a difference in what you are suggesting. Both scripts run for about 10 minutes when processing a 500mb CSV file. – IOInterrupt Jul 11 '11 at 21:25
0

I would recommend:

  • as somebody else has suggested, profile to see where the actual bottleneck is;
  • tell us what the actual regexes are: it may be that you're using some specific subpattern that isn't very efficient in Java's implementation.

It's quite possible that parts of PHP's regex engine are more optimised than Java's for specific expression types, and/or there's a way to optimise the actual expression that you're using.

Neil Coffey
  • 21,615
  • 7
  • 62
  • 83