3

I would like to find the number of instances of "$$$$" pattern in a text file. Following method works with some files, but not with all files. For example, it does not work with the following file (http://www.hmdb.ca/downloads/structures.zip - it is a zipped text file with .sdf extension) I can't figure out why? I also tried to escape whitespaces. No luck. It returns 11 when there is more than 35000 "$$$$" patterns. Please note, the speed is crucial. Therefore, I can't use any slower methods.

public static void countMoleculesInSDF(String fileName)
{
    int tot = 0;
    Scanner scan = null;
    Pattern pat = Pattern.compile("\\$\\$\\$\\$");

    try {  
        File file = new File(fileName);
        scan = new Scanner(file);
        long start = System.nanoTime();
        while (scan.findWithinHorizon(pat, 0) != null) {
            tot++;
        }
        long dur = (System.nanoTime() - start) / 1000000;
        System.out.println("Results found: " + tot + " in " + dur + " msecs");
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        scan.close();
}
}
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
lochi
  • 860
  • 2
  • 12
  • 26

2 Answers2

3

For the linked file and your code as you have posted it, I had constantly a total of 218 matches. This, of course is not correct: verifying using notepad++'s count function, the file should contain 41498 matches. So there muss be something wrong with the Scanner (I thought) and started debugging inside it when the last match was done i.e. when the Scanner told that there are no more matches left. Doing so I came across an exception in it's private method readInput() which is not directly thrown but instead saved in a locale variable.

try {
    n = source.read(buf);
} catch (IOException ioe) {
    lastException = ioe;
    n = -1;
}

This exception can be retrieved using the method Scanner#ioException():

IOException ioException = scanner.ioException();
if (ioException != null) {
    ioException.printStackTrace();
}

Printing this exception have then shown that some input could not be decoded

java.nio.charset.UnmappableCharacterException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:278)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
    at java.io.Reader.read(Reader.java:100)
    at java.util.Scanner.readInput(Scanner.java:849)

So I just tried and passed a character set to the Scanner's constructor:

scan = new Scanner(file, "utf-8");

And it made it work!

Results found: 41498 in 2431 msecs

So the problem was that the Scanner has used the platform 's charset which was not suitable to completely decode the file you have.

Moral of the story:

  1. Always explicitly pass a charset when working with text.
  2. Check for IOException when working with Scanner.

PS: Some handy ways to quote a string for using as regex

Pattern pat = Pattern.compile("\\Q$$$$\\E");

or

Pattern pat = Pattern.compile(Pattern.quote("$$$$"));
Community
  • 1
  • 1
A4L
  • 17,353
  • 6
  • 49
  • 70
0

Here's what I ended up doing... (before you posted your answer). This method seems to be faster than scanner. What implementation would you suggest? Scanner or memory mapping? Will memory mapping fail for large files? Not sure..

private static final Charset CHARSET = Charset.forName("ISO-8859-15");
private static final CharsetDecoder DECODER = CHARSET.newDecoder();

public static int getNoOfMoleculesInSDF(String fileName) 
    {   
    int total=0;
    try
    {    
    Pattern endOfMoleculePattern = Pattern.compile("\\$\\$\\$\\$");
    FileInputStream fis = new FileInputStream(fileName);
    FileChannel fc = fis.getChannel();
    int fileSize = (int) fc.size();
    MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fileSize);
    CharBuffer cb = DECODER.decode(mbb);
    Matcher matcher = endOfMoleculePattern.matcher(cb);
    while (matcher.find()) {
      total++;
    }
    }
    catch(Exception e)
    {
        LOGGER.error("An error occured while counting molecules in the SD file");
    }
    return total;
    }
lochi
  • 860
  • 2
  • 12
  • 26
  • 1
    This method looks good too, but unfortunately it doesn't work out for large files, such as the one you have linked (~ 250MB). It crashes with a `OutOfMemoryError: Java heap space` because `DECODER.decode(mbb)` tries to allocate a char buffer as big as the files itself that even increasing the jvm heap space with the option `-Xmx` would not avoid. What I have tried before is using a buffered reader and applying the pattern on each line, it worked fine but took 4 time longer than Scanner. I think Scanner method is the best choice to avoid OOMEs at runtime. Scanner's buffer is only 1024! – A4L Oct 08 '13 at 21:40
  • Please see the answer of this [question](http://stackoverflow.com/questions/7298455/huge-arrays-throws-out-of-memory-despite-enough-memory-available) as for why an OOME can still happen despite the setting with `-Xmx` – A4L Oct 08 '13 at 21:47
  • This method did work with -Xms2000m. It was much faster - 600ms compared to 1900ms for the same file. However, limited memory can become a problem. I am going to go with scanner.. – lochi Oct 08 '13 at 22:18
  • Wow, that's a lot of memory! I went up only till -Xmx1G with no luck, after that my System even failed to allocate memory for the jvm itself. Indeed you cannot rely on the fact that you can always have so much memory for your application, also if the file is even larger then you'll end up needing more memory! You could still speed up scanner by increasing its buffer size, unfortunately that member is `final` and `private`, but [with reflection almost everything is possible](http://stackoverflow.com/questions/3301635/change-private-static-final-field-using-java-reflection) ;-) – A4L Oct 09 '13 at 07:51