Huge XML in Clojure

Question

I'm new to Clojure and my first project has to deal with huge (250+GB) XML file. I want to put it into PostgreSQL to process it later on, but have no idea how to approach such a big file.

Start by understanding how to approach a small file, then work up. — Mongus Pong, Mar 30 '12 at 08:57
What this XML looks like? Highly arborescent or a flat collection of numerous items? — cgrand, Mar 30 '12 at 09:26

score 20 · Accepted Answer · answered Mar 30 '12 at 15:31

20

I used the new clojure.data.xml to process a 31GB Wikipedia dump on a modest laptop. The old lazy-xml contrib library did not work for me (ran out of memory).

https://github.com/clojure/data.xml

Simplified example code:

(require '[clojure.data.xml :as data.xml]) ;'

(defn process-page [page]
  ;; ...
  )

(defn page-seq [rdr]
  (->> (:content (data.xml/parse rdr))
       (filter #(= :page (:tag %)))
       (map process-page)))

answered Mar 30 '12 at 15:31

Justin Kramer

3,983
1
23
23

so is this what @ivant is referring to? the clojure io implementation for lazy-xml is broken somehow? – andrew cooke Mar 30 '12 at 16:02
Yes, it has issues. Regardless, it's part of old clojure contrib and is deprecated. `data.xml` is the replacement. – Justin Kramer Mar 30 '12 at 16:26
OK - I spent a few hours trying all the possible combinations of ((())) but with no success. I get the StackOverflow Error and it's - as I understand it - because I use this: `(with-open [rdr (BufferedReader. (FileReader. file-name))]` and should use some input stream, but I'm new to Clojure and after those few hours... Could you help? – trzewiczek Mar 30 '12 at 19:22
3

Sorry, I forgot to mention: I encountered that StackOverflow error, and it's due to an upstream bug in Java. The workaround is to use `(data.xml/parse rdr :coalescing false)` which only works on the master branch of `data.xml` for now. The 0.0.3 release doesn't have the `:coalescing` option. – Justin Kramer Mar 30 '12 at 20:39
I made some night tests, but in the morning I saw that it crashes after 32k wiki revisions (yes, it's a full polish wikipedia dump) with an error: `java.lang.OutOfMemoryError: Java heap space`. I tried *everything*, but still, can't process the data. – trzewiczek Mar 31 '12 at 05:29
Hard to troubleshoot that without seeing code. Might be worth posting it as a new question. – Justin Kramer Mar 31 '12 at 20:03
2

Got this working with wikipedia dumps - nice. You'll want to wrap it in `with-open` and `dorun` to get it to lazily work through a large list: (with-open [rdr (fn-that-opens-a-reader)] (dorun (->> ...as above... – Korny Jul 07 '13 at 12:55

score 2 · Answer 2 · answered Mar 30 '12 at 11:54

2

processing huge xml is usually done with SAX, in case of Clojure this is http://richhickey.github.com/clojure-contrib/lazy-xml-api.html

see (parse-seq File/InputStream/URI)

answered Mar 30 '12 at 11:54

zmila

1,651
1
11
13

The API may be lazy, but IO isn't, so I doubt it would work on a file of that size. – ivant Mar 30 '12 at 12:25
2

@ivant you connect it to an input stream that reads data incrementally. it's standard practice for processing large xml files in java. – andrew cooke Mar 30 '12 at 13:04
see Justin's answer for an explanation of what ivant may be referring to here. – andrew cooke Mar 30 '12 at 16:33

score 0 · Answer 3 · answered Nov 08 '12 at 09:17

0

You can also use expresso XML parser for massive files (www.expressoxml.com). It can parse files of 36GB and more as it is not limited by file size. It can return up to 230,000 elements from a search and it is available via streaming over the “cloud” from their website. And best of all their developer version is free.

answered Nov 08 '12 at 09:17

Laura Cavanagh

19
1

3

Even though you haven't tried to disguise this advert as impartial advice, it's best to explicitly state your strong affiliation with that product. https://twitter.com/Lughnasagh/status/260387856772653056. – Rob Grant Apr 24 '14 at 14:25

score 0 · Answer 4 · answered Mar 30 '12 at 12:35

0

If the xml is a collection of records, https://github.com/marktriggs/xml-picker-seq is what you need to process records in xml regardless of the xml size. It uses XOM under the hood and processes one 'record' at a time.

answered Mar 30 '12 at 12:35

Shanmu

938
7
15

I tried that too, but with no success. I mean - it did the trick about the huge file, but I cant get the reasults with xpath-query - empty results come out of it. The only xpath query that works is ".", but it 's not what i wanted... Couldn't manage this problem for more then two hours... :( – trzewiczek Mar 30 '12 at 20:39

Huge XML in Clojure

4 Answers4

Linked