3

I'm reading some very large text files and running into the error described in Java - Char Buffer Issue.

When I have a very large file (>1gb), Charset.defaultCharset().decode(ByteBuffer bb).toString() throws an IllegalArgumentException. Presumably because the buffer capacity overflows and becomes a negative number.

Here's the slurp function I've been using:

public static String slurp(File f) throws IOException, FileNotFoundException
    {
        FileInputStream fis = new FileInputStream(f);
        try{
            FileChannel fc = fis.getChannel();
            MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

            //The decode method on the following line throws the IllegalArgumentException
            return Charset.defaultCharset().decode(bb).toString();
        }finally{
            fis.close();
        }
    }

I'd like to simply add error handling to this function for when the exception is thrown to use an alternative, safer method, such as the pattern from the question statement in How do I create a Java string from the contents of a file?

For example,

public static String slurp(File f) throws IOException, FileNotFoundException
    {
        FileInputStream fis = new FileInputStream(f);
        try{
            FileChannel fc = fis.getChannel();
            MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
            return Charset.defaultCharset().decode(bb).toString();
        } catch (IllegalArgumentException e) {
            // This exception is thrown by extremely large files
            BufferedReader reader = new BufferedReader(new FileReader(f));
            String line = null;
            StringBuilder stringBuilder = new StringBuilder();
            String ls = System.getProperty("line.separator");

            while ((line = reader.readLine()) != null) {
                stringBuilder.append(line);
                stringBuilder.append(ls);
            }

            return stringBuilder.toString();

        }finally{
            fis.close();
        }
    }

An alternative would be to use the most memory efficient proposed answer on the same question.

public static String slurp(File f) throws IOException, FileNotFoundException
{
    FileInputStream fis = new FileInputStream(f);
    try{
        FileChannel fc = fis.getChannel();
        MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
        return Charset.defaultCharset().decode(bb).toString();
    } catch (IllegalArgumentException e) {
        // This exception is thrown by extremely large files
        List<String> lines = Files.readAllLines(f.toPath(), Charset.defaultCharset());
        return String.join("/n", lines);

    }finally{
        fis.close();
    }
}

Any large file is going to be cumbersome in memory when slurped rather than streamed, but is there any reason to prefer one of these two methods or something else all together?

I ask because the accepted answer to the question discusses the memory utilization of the two answer solutions but not the questioner's example pattern.

Community
  • 1
  • 1
Cecilia
  • 4,512
  • 3
  • 32
  • 75
  • 1
    Can you show us a sample of the text file contents? If it's structured data of some kind, I'd use a different strategy. Only read the file a few bytes at a time, and process that data, then read more. Even if your char buffer doesn't kill you, appending the whole gig+ file to a String is still going to have huge memory penalties. There's also an assumption in your code that there are enough end of line characters to break up the file so that `readLine()` knows where to break. – Lucien Stals Dec 09 '15 at 01:38
  • @Lucien It's a json file. I've streamed json before with gson, but I was hoping for a less intrusive fix. – Cecilia Dec 09 '15 at 01:42
  • I think the problem is that you are trying to read the whole file before parsing it. Obviously with a big file, that's memory intensive. Some of the lower level classes (like the `FileInputStream` itself) provide ways to read only a few bytes at a time (like the `public int read(byte[] b, int off, int len)` method). What I don't know is how that helps you parse JSON without reading the whole file first. Something like a massive CSV file is easy enough if you process it one line at a time, but I imagine you need to see most of the JSON before you know how to handle it. – Lucien Stals Dec 09 '15 at 01:50
  • A little Googling came up with this library... https://github.com/FasterXML/jackson I can't say if it's any good (I've never used it), but it does claim to do "low-level streaming" – Lucien Stals Dec 09 '15 at 01:51
  • You can read millions of lines a second with `BufferedReader.readLine()`, and without leaving behind expensive memory mappings of files either. And there are few general computing problems that require reading entire files into memory. Process it a line at a time. – user207421 Dec 09 '15 at 02:38

0 Answers0