I'm new to Clojure and my first project has to deal with huge (250+GB) XML file. I want to put it into PostgreSQL to process it later on, but have no idea how to approach such a big file.
-
2Start by understanding how to approach a small file, then work up. – Mongus Pong Mar 30 '12 at 08:57
-
6What this XML looks like? Highly arborescent or a flat collection of numerous items? – cgrand Mar 30 '12 at 09:26
4 Answers
I used the new clojure.data.xml
to process a 31GB Wikipedia dump on a modest laptop. The old lazy-xml
contrib library did not work for me (ran out of memory).
https://github.com/clojure/data.xml
Simplified example code:
(require '[clojure.data.xml :as data.xml]) ;'
(defn process-page [page]
;; ...
)
(defn page-seq [rdr]
(->> (:content (data.xml/parse rdr))
(filter #(= :page (:tag %)))
(map process-page)))

- 3,983
- 1
- 23
- 23
-
so is this what @ivant is referring to? the clojure io implementation for lazy-xml is broken somehow? – andrew cooke Mar 30 '12 at 16:02
-
Yes, it has issues. Regardless, it's part of old clojure contrib and is deprecated. `data.xml` is the replacement. – Justin Kramer Mar 30 '12 at 16:26
-
OK - I spent a few hours trying all the possible combinations of ((())) but with no success. I get the StackOverflow Error and it's - as I understand it - because I use this: `(with-open [rdr (BufferedReader. (FileReader. file-name))]` and should use some input stream, but I'm new to Clojure and after those few hours... Could you help? – trzewiczek Mar 30 '12 at 19:22
-
3Sorry, I forgot to mention: I encountered that StackOverflow error, and it's due to an upstream bug in Java. The workaround is to use `(data.xml/parse rdr :coalescing false)` which only works on the master branch of `data.xml` for now. The 0.0.3 release doesn't have the `:coalescing` option. – Justin Kramer Mar 30 '12 at 20:39
-
I made some night tests, but in the morning I saw that it crashes after 32k wiki revisions (yes, it's a full polish wikipedia dump) with an error: `java.lang.OutOfMemoryError: Java heap space`. I tried *everything*, but still, can't process the data. – trzewiczek Mar 31 '12 at 05:29
-
Hard to troubleshoot that without seeing code. Might be worth posting it as a new question. – Justin Kramer Mar 31 '12 at 20:03
-
2Got this working with wikipedia dumps - nice. You'll want to wrap it in `with-open` and `dorun` to get it to lazily work through a large list: (with-open [rdr (fn-that-opens-a-reader)] (dorun (->> ...as above... – Korny Jul 07 '13 at 12:55
processing huge xml is usually done with SAX, in case of Clojure this is http://richhickey.github.com/clojure-contrib/lazy-xml-api.html
see (parse-seq File/InputStream/URI)

- 1,651
- 1
- 11
- 13
-
The API may be lazy, but IO isn't, so I doubt it would work on a file of that size. – ivant Mar 30 '12 at 12:25
-
2@ivant you connect it to an input stream that reads data incrementally. it's standard practice for processing large xml files in java. – andrew cooke Mar 30 '12 at 13:04
-
see Justin's answer for an explanation of what ivant may be referring to here. – andrew cooke Mar 30 '12 at 16:33
You can also use expresso XML parser for massive files (www.expressoxml.com). It can parse files of 36GB and more as it is not limited by file size. It can return up to 230,000 elements from a search and it is available via streaming over the “cloud” from their website. And best of all their developer version is free.

- 19
- 1
-
3Even though you haven't tried to disguise this advert as impartial advice, it's best to explicitly state your strong affiliation with that product. https://twitter.com/Lughnasagh/status/260387856772653056. – Rob Grant Apr 24 '14 at 14:25
If the xml is a collection of records, https://github.com/marktriggs/xml-picker-seq is what you need to process records in xml regardless of the xml size. It uses XOM under the hood and processes one 'record' at a time.

- 938
- 7
- 15
-
I tried that too, but with no success. I mean - it did the trick about the huge file, but I cant get the reasults with xpath-query - empty results come out of it. The only xpath query that works is ".", but it 's not what i wanted... Couldn't manage this problem for more then two hours... :( – trzewiczek Mar 30 '12 at 20:39