0

i am working on a standalone java application (spring boot) that parse & process several big xml files around 3 ~ 4 Go to generate one file that combine the data in the 3 files ( 1st file is the specs of products, 2nd the details of products, 3rd file someother information for the prodect ) so to get the full information for one node i have to read all the file;

my issue that we don't have RAM ( our clients ) so i tried exist db ( just load the file & write it ) , it's kind fast but the RAM usage still too hight for 1.5 Go xml file, it will consume 1.6 ~ 1.7 go so is there any solution that can lower the ram usage

Thanks in advance

  • 4
    Possible duplicate of [How to Parse Big (50 GB) XML Files in Java](http://stackoverflow.com/questions/26310595/how-to-parse-big-50-gb-xml-files-in-java) – DimaSan Dec 19 '16 at 11:13
  • Do you already use a streaming parser? See also: http://stackoverflow.com/questions/3969713/java-xml-parser-for-huge-files The bad thing: you may stream each file several times, but that way your max. memory consumption stays lower then when reading everything and keeping it in memory. – Roland Dec 19 '16 at 12:50
  • thanks for the answer, yes i am using streaming parser, and the solution you propose will work, but for the process time will explose so i can't really use this solution – Mohamed BAHRIA Dec 19 '16 at 12:53
  • TL; DR; if it's big, split it. I once wrote a component that roughly had 3 steps : 1) split the input files (one split file for 1000 "products"), 2) parallely translate the splits (many, small enough splits) 3) merge the translations (if your case is simple, a simple `cat` should do it). Even with a sorting step, on-the-fly GZIPing (of the intermediate files) and out-of-order merging at step 3, it handled 50GB inputs in <128 MB of RAM (the parallel XSLT at step 2 being the bottlneck in my case), with tens of MBs/sec of throughput per input file times CPU core. – GPI Dec 19 '16 at 14:57

1 Answers1

0

so the best solutions was to split the nodes,for each node i will generate a file which the filename is the id of the node that we will zip, so the next time i want to access a node it will be very fast cause the zip is indexed

output.zip :
--> id_nodes1
--> id_nodes2
--> id_nodes3
--> id_nodes4
--> ....

thanks all for your answer

  • Not knowing the exact number of items, one can only guess but when I benched this on hundreds of thousands of nodes (and thus, files), the file system calls were a real bottleneck (plus it'd create awfull sysadmin problems, like impossibility to do ls * because the list was so big...). It was *way, way* faster and more "system friendly" to group items by 1000 or so, than to create a file per item. – GPI Dec 22 '16 at 14:22
  • If the nodes were ordered in each file, there would be no need to have more than 1 node per file in memory. – Tony BenBrahim Dec 23 '16 at 02:40