2

I have been trying to read an XML file which is of 2GB. I have followed different methods to read it but each of those methods give OutOfMemoryError I even tried to increase heapsize max to 4GB and min 2GB heap size in eclispe but still problem persists. How can i resolve this problem? I don't want to use any third party libaray.

Following is the code that i have tried so far

String str = new String(Files.readAllBytes(Paths.get(pathname)),
                    StandardCharsets.UTF_8);

and

try(Scanner scanner = new Scanner(new File(pathname))) {

while ( scanner.hasNextLine() ) {

    String line = scanner.nextLine();

   }

}
Zeeshan Shabbir
  • 6,704
  • 4
  • 38
  • 74
  • 1
    Why are you reading the XML file? What do you want to do with the information? This is critical to finding a solution to the problem. Reading the whole file into a single Java string doesn't seem a good way of starting, regardless what you want to do with it next. – Michael Kay Sep 23 '18 at 16:57
  • @MichaelKay I agree with you. I had this task as of my data science project. It was initial requirement by my mentor to store all data in one string. I had to perform specific task on that particular strings then. I ended up using Sax parser then. – Zeeshan Shabbir Sep 26 '18 at 05:07

1 Answers1

2

Each character uses at least 2 bytes and you also need memory for processing. I would give it a lot more memory like 24 GB and see how much it really needs.

Note Java 9+ has compressed string which can reduce consumption.

A better approach is to use SAX parser to process the file as you read it which will use a tiny fraction of the memory.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • My machine is of 8GB. I tried 8 as well, nothing happens. I assume then there is no way to go. I'd have to use SAx parser. Thank you for answer. – Zeeshan Shabbir Sep 23 '18 at 10:59
  • 3
    @PeterLawrey what is a bit painful is that even if there is a single letter outside `Latin1` - the entire String uses 2 bytes per char – Eugene Sep 23 '18 at 11:00
  • Oh wait, that means i can't have whole content of xml into one string. – Zeeshan Shabbir Sep 23 '18 at 11:05
  • The question is if yiu really need to have in memory the whole 2gb, maybe you can read and process directly – vmrvictor Sep 23 '18 at 11:46
  • @Eugene Oracle/OpenJDK Java 9+'s Compressed Strings can use a byte[] instead of a char[]. BTW Java 6 used to have the feature but it was dropped in Java 7. – Peter Lawrey Sep 23 '18 at 17:03
  • @ZeeshanShabbir BTW A String is limited to 2 billion characters (2^31-1) – Peter Lawrey Sep 23 '18 at 17:03
  • BTW 16 GB is around $100 so might be worth the investment. – Peter Lawrey Sep 23 '18 at 17:04
  • 2
    @PeterLawrey not can, but will. always. Its just a matter oh how big that byte array is. My point was that because a certain letter is outside Latin1, meaning that that single letter for example needs to be encoded into 2 bytes, it spoils the party for everyone else; even if everyone else needs just one byte. – Eugene Sep 23 '18 at 17:06
  • 2
    @PeterLawrey also [this is related](https://stackoverflow.com/questions/44178432/difference-between-compact-strings-and-compressed-strings-in-java-9) – Eugene Sep 23 '18 at 17:10
  • @Eugenen Thank you for that reference. +1 – Peter Lawrey Sep 23 '18 at 17:13
  • 1
    @Eugene and that single non-latin character does not only cause the string to consume twice the memory; since it is a byte array in Java 9, needing two bytes per character reduces the maximum string size to the halve of the maximum array size, i.e. less than 2³⁰ characters. – Holger Sep 24 '18 at 07:14