5

I have a csv file with 12000 rows. Each row has several fields enclosed in double quotes and separated by comma. One of this field is an xml document, thus the row can be very long. The file size is 174 Mb.

Here is an example of the file:

"100000","field1","field30","<root><data>Hello I have a
line break</data></root>","field31"
"100001","field1","field30","<root><data>Hello I have multiple
line 
break</data></root>","field31"

The problem with this file is inside the xml field which can have one or more line breaks and thus can break the parsing. The goal here is to read the whole file and apply a regex which will replace all the line breaks inside double quotes with an empty string.

The following code gives me OutOfMemoryError:

    String path = "path/to/file.csv";

    try {
        byte[] content = Files.readAllBytes(Paths.get(path));
    }
    catch (Exception e) {
        e.printStackTrace();
        System.exit(1);
    }

I've also tried to read the file using BufferedReader and StringBuilder, got OutOfMemoryError around line 5000:

String path = "path/to/file.csv";

    try {
        StringBuilder sb = new StringBuilder();
        BufferedReader br = new BufferedReader(new FileReader(path));
        String line;
        int count = 0;
        while ((line = br.readLine()) != null) {
            sb.append(line);
            System.out.println("Read " + count++);
        }
    }
    catch (Exception e) {
        e.printStackTrace();
        System.exit(1);
    }

I've tried to run both of the programs above with different java heap values, like -Xmx1024m, -Xmx4096m, -Xmx8092m. In all cases I got OutOfMemoryError. Why is this happening, considering that the file size is 174Mb?

George Z.
  • 6,643
  • 4
  • 27
  • 47
revy
  • 3,945
  • 7
  • 40
  • 85
  • 1
    Are you sure you set correct -Xmx argument? Have you tried monitoring your heap space? – Pavel Smirnov Mar 27 '19 at 11:26
  • 5
    Possible duplicate of [out of memory error, java heap space](https://stackoverflow.com/questions/20626652/out-of-memory-error-java-heap-space) – Mebin Joe Mar 27 '19 at 11:28
  • 1
    A memory mapped ByteBuffer springs to mind. But what regex operation do you want to achieve? BTW `new StringBuilder(99999)` and `sb.append(line).append('\n');` – Joop Eggen Mar 27 '19 at 11:32
  • Are you using a 64 bit JRE? – Joop Eggen Mar 27 '19 at 11:35
  • lookout for the garbage collector... – aran Mar 27 '19 at 11:37
  • If each row is an xml document so maybe you should use streaming API or `XMLPath`? What kind of processing needs all `XML` documents if they are independent? Maybe you should optimize algorithm? – Michał Ziober Mar 27 '19 at 11:40
  • @PavelSmirnov Yes I am. – revy Mar 27 '19 at 11:42
  • @JoopEggen I've tried with StringBuilder(99999) and sb.append(line).append('\n') and got OutOfMemoryError at line 6888. Yes I am using 64bit jre. – revy Mar 27 '19 at 11:42
  • *Why* do you think you have to apply a single regex over the whole file in memory? What's stopping you doing it line by line? – user207421 Mar 27 '19 at 11:44
  • @user207421 Basically this xml field is enclosed in double quotes and can have line breaks that broke the parsing. I need to remove line breaks that are inside double quotes. I've updated the question – revy Mar 27 '19 at 11:55
  • @revy. as far as I can see, there's nothing criminal in your code. 174mb file is not that big if you have really set -Xmx1024m arguments and above. You should analyze your heap space with one of the monitoring tools, VisualVM, for example. Without it it's really difficult to tell more. – Pavel Smirnov Mar 27 '19 at 11:57

3 Answers3

3

You need to use double buffers to parse your special data structure, and process them line-by-line. Reading the whole document is not the best idea.

Create an own BufferedReader that reads lines with an inner BufferedReader of your CSV file. After reading a line, try to determine whether you need to read more lines to finish one line in CSV (e.g. if you know that your XML starts with <root> and ends with </root>, check the presence of these strings, and read and append until you reach the closing token - that will be the last line for your CSV line).

The second layer will be your CSV processing, based in the CSV line you get from the first step. Parse it, save it, process it, then throw it. Then it will not consume more memory space, the Java Garbage Collector will free it up.

This is the only way to deal with large files. It is also called "streaming model", because you pass only small chunks of data through, so the actual memory consumption is low.

gaborsch
  • 15,408
  • 6
  • 37
  • 48
  • Yes I know how to solve line by line. The thing is that I was very surprised that a 174Mb file cannot be kept in memory, even with an heap size of 12GB... – revy Mar 27 '19 at 12:08
  • The file size is doubled, because you have UTF8 characters that eat up 2 bytes. What is possible that if you want to allocate a big continuous space, the JVM will not find a suitable one, because the memory is fragmented. But from your second example I still cannot see how it would be possible. So, the question is good, try to get GC information, memory load, etc. – gaborsch Mar 27 '19 at 12:23
  • @gaborsch he tried allocating 4 GB and 8 GB respectively, there is enough continuous free memory. – Karol Dowbecki Mar 27 '19 at 12:38
  • Try to run `jstat` while reading your file (possibly with `sleep()`s, to be able to follow the process) That will provide some insight what is actually happening in the JVM, which memory segment is running out of space. – gaborsch Mar 27 '19 at 13:04
2

Wrap your InputStream with a filtering one:

class QuotedNewLineFilterInputStream extends FilterInputStream {

    private boolean insideQuotes;

    public QuotedNewLineFilterInputStream(InputStream in) {
        super(in);
    }

    @Override
    public int read() throws IOException {
        int c = super.read();
        if (c == '\"') {
            insideQuotes = !insideQuotes;
        }
        if (insideQuotes && (c == '\n' || c == '\r')) {
            c = read();
        }
        return c;
    }
}

This removes LF and CR inside double quotes. As all are ASCII, and the XML is probable in UTF-8, one can work on the byte level (InputStream).

By the way a replacement with a \t might better preserve the layout (c =\t' i.o. c = read()).

Not very intelligent, but the simple solution.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Simple, _and intelligent_, because it doesn't try to eliminate line breaks while using an input mechanism that uses those line breaks for its own purposes. – kdgregory Mar 27 '19 at 14:14
0

If reading a 174 MB file with Files.readAllBytes(Paths.get(path)); causes OutOfMemoryError than your failed to increase the memory limit with -Xmx8g. With 8 GB heap memory there should be no problem to allocated 174 MB of continuous memory for byte[]

Double check how did you pass the -Xmx flag. You can verify the JVM runtime options by connecting to a running JVM provess with JConsole, JVisualVM or other tool. Take a look at Using JConsole which shows how to check JVM runtime options e.g. Memory Tab.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111