-1

I have several XML files ( in size of GBs ) that are to be converted to JSON. I am easily able to convert small sized files ( in KiloBytes ) using the JSON library ( org.json - https://mvnrepository.com/artifact/org.json/json/20180813 ).

Here's the code that i am using

            static String line="",str="";
            BufferedReader br = new BufferedReader(new FileReader(link));
            FileWriter fw = new FileWriter(outputlink);
            JSONObject jsondata = null;

            while ((line = br.readLine()) != null) 
            {   
                str+=line;  
            }
            jsondata = XML.toJSONObject(str);

But the large files ( even the <100 MB ones ) are taking too long to process and the larger ones are throwing java.lang.OutOfMemoryError: Java heap space. So, how to optimize the code to process large files ( or any other approach/library ).

UPDATE

I have updated the code and I am writing XML into JSON segment by segment

My XML :

<PubmedArticleSet>
     <PubmedArticle>
     </PubmedArticle>
     <PubmedArticle>
     </PubmedArticle>
...
</PubmedArticleSet>

So I am ignoring the root node <PubmedArticleSet> ( I will add it later ) converting each <PubmedArticle> </PubmedArticle> to JSON and writing at a time

         br = new BufferedReader(new FileReader(link));
         fw = new FileWriter(outputlink,true);
         StringBuilder str = new StringBuilder();
         br.readLine(); // to skip the first three lines and the root 
         br.readLine();
         br.readLine();

         while ((line = br.readLine()) != null) {


            JSONObject jsondata = null;

            str.append(line);
            System.out.println(str);
            if (line.trim().equals("</PubmedArticle>")) { // split here


                jsondata = XML.toJSONObject(str.toString());

                String jsonPrettyPrintString = jsondata.toString(PRETTY_PRINT_INDENT_FACTOR);
                fw.append(jsonPrettyPrintString.toString());

                System.out.println("One done"); // One section done
                str= new StringBuilder();


            }
           }
            fw.close();

I am no longer getting the HeapError but still the processing is taking hours for ~300 MB range files. Kindly provide any suggestions to speed up this process.

RohanJ
  • 606
  • 1
  • 6
  • 18
  • That's where a Json encoder/decoder written in pure C should come into aid. [Parsing XML in Pure C](https://stackoverflow.com/questions/4846568/parsing-xml-in-pure-c) & [Parsing JSON using C](https://stackoverflow.com/questions/6673936/parsing-json-using-c). You could try to port them into Java with JNI. – KaiserKatze Aug 22 '18 at 03:43
  • 1
    There are several Java libraries that can handle this conversion for you (and in a much more efficient way). See this answer for some examples: https://stackoverflow.com/a/39493394/1420773 – ninge Aug 22 '18 at 03:56
  • 1
    Is the XML data structured in a way that you can divide it into chunks and serialise them individually? If so you could write each chunk out after reading it, rather than loading them all. – teppic Aug 22 '18 at 04:03
  • When you say "the JSON library" you need to say which one. There are dozens. – Michael Kay Aug 22 '18 at 09:04
  • @MichaelKay By "the JSON library" i mean JSON library. ( org.json ) I have updated the question. Kindly suggest any better alternative library , if there , to achieve the task. – RohanJ Aug 22 '18 at 11:59
  • With the code in your update the string will only grow, you need to reset it after each `PubmedArticle`. Also why do you open and close the output file for each record? – Henry Aug 22 '18 at 15:31
  • @Henry Thanks for the insight. I have made the changes. – RohanJ Aug 22 '18 at 15:53

2 Answers2

3

This statement is the main reason that kills your performance:

str+=line;

This causes the allocation, copying and deallocation of numerous of String objects.

You need to use a StringBuilder:

StringBuilder builder = new StringBuilder();
while ( ... ) {
    builder.append(line);
}

It may also help (to a lesser extent) to read the file in larger chunks and not line by line.

Henry
  • 42,982
  • 7
  • 68
  • 84
  • It's still not enough, Testing on a 380 MB file gave OutofMemory error at builder.append(line); – RohanJ Aug 22 '18 at 03:49
  • The question was about being to slow, not lack of memory. Did you allow the JVM to use more memory (`-Xmx` option)? – Henry Aug 22 '18 at 03:53
  • Yes, current settings in ini -Xms512m -Xmx1536m – RohanJ Aug 22 '18 at 03:57
  • 2
    A 300MB file will need 600 MB to store the character data alone, You also keep the whole object in memory (presumably twice, JSON and XML) so 1.5G may not be enough. – Henry Aug 22 '18 at 04:02
  • If 1.5 G is not enough then need to optimize code ( or find alternatives ) as the use case requires me to process files in Gigabytes – RohanJ Aug 22 '18 at 11:55
  • 1
    Sure, but this was not your original question. The solution to this second problem is streaming (i.e. processing the file part by part and outputting partial results as you proceed). It depends on the structure of the XML and the JSON how difficult (or even how feasible) this is. It is maybe best to ask another question giving details of the respective structures. – Henry Aug 22 '18 at 14:12
0

The IO operation of reading a large file is very time consuming. Try utilizing a library to handle this for you. For example with apache commons IO:

File xmlFile= new File("D:\\path\\file.xml");
String xmlStr= FileUtils.readFileToString(xmlFile, "UTF-8");
JSONObject xmlJson = XML.toJSONObject(xmlStr);
ninge
  • 1,592
  • 1
  • 20
  • 40
  • i do wonder, is `StringBuilderWriter` is better than `StringBuilder`? i saw Apache Commons IO FileUtils used it [behind the scene](https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/output/StringBuilderWriter.java).. – Bagus Tesa Aug 22 '18 at 04:18
  • From the javadoc, from your link: Writer implementation that outputs to a StringBuilder. This implementation, as an alternative to java.io.StringWriter, provides an un-synchronized (i.e. for use in a single thread) implementation for better performance. For safe usage with multiple Threads then java.io.StringWriter should be used. – ninge Aug 22 '18 at 04:40