0

I have a large dump of Stackoverflow data in XML File.I need to split the file into small chunks of XML files of around 500 MB each.Please provide some suggestions

  • 4
    My suggestion is to try it first and then come back on SO with more specific problem, not vice-versa. – lukelazarovic Oct 24 '14 at 06:58
  • Possible duplicate of http://stackoverflow.com/questions/19177994/java-read-file-and-split-into-multiple-files – SMA Oct 24 '14 at 06:58
  • 2
    I hate it the way some SO people regard questions about design or technology choice as out of scope, and then are happy to criticise when someone makes a poor choice of technology. It's a perfectly good question. – Michael Kay Oct 24 '14 at 20:35
  • 1
    +1 Michael. The problem is specific enough, and I don't see why we should demand that he try to load it all in memory (won't work) or write his own streamer first (a nontrivial undertaking) before getting suggestions on a direction. –  Oct 24 '14 at 21:41

1 Answers1

1

Depending on your needs, you might be able to use the Unix split utility. It won't know about your element boundaries though.

If you need to do this in an XML-aware fashion, here's an article describing another approach, via XML streaming. Coincidentally it describes breaking down a 30 GB XML file:

http://java.dzone.com/articles/splitting-large-xml-files-java

EDIT: Michael Kay notes in a comment below (I guess he can't add an answer since the question is closed) that XSLT 3.0 adds support for streaming, which allows you to process huge files without having everything in memory. Though XSLT 3.0 is at the time I'm writing a draft spec, the Saxon-EE product (which is commercial) supports nearly all of the draft spec.