OutOfMemoryError when parsing XML in Clojure with data.zip

Question

I want to use Clojure to extract the titles from a Wiktionary XML dump.

I used head -n10000 > out-10000.xml to create smaller versions of the original monster file. Then I trimmed with a text editor to make it valid XML. I renamed the files according to the number of lines inside (wc -l):

(def data-9764 "data/wiktionary-en-9764.xml") ; 354K
(def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M
(def data-995066 "data/wiktionary-en-995066.xml") ; 34M
(def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M

Here is the overview of the XML structure:

<mediawiki>
  <page>
    <title>dictionary</title>
    <revision>
      <id>20100608</id>
      <parentid>20056528</parentid>
      <timestamp>2013-04-06T01:14:29Z</timestamp>
      <text xml:space="preserve">
        ...
      </text>
    </revision>
  </page>
</mediawiki>

Here is what I've tried, based on this answer to 'Clojure XML Parsing':

(ns example.core
  (:use [clojure.data.zip.xml :only (attr text xml->)])
  (:require [clojure.xml :as xml]
            [clojure.zip :as zip]))

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)
        zipped (zip/xml-zip xml)]
    (xml-> zipped :page :title text)))

(count (titles data-9764))
; 38

(count (titles data-99224))
; 779

(count (titles data-995066))
; 5172

(count (titles data-7999931))
; OutOfMemoryError Java heap space  java.util.Arrays.copyOfRange (Arrays.java:3209)

Am I doing something wrong in my code? Or is this perhaps a bug or limitation in the libraries I'm using? Based on REPL experimentation, it seems like the code I'm using is lazy. Underneath, Clojure uses a SAX XML parser, so that alone should not be the problem.

See also:

Update 2013-04-30:

I'd like to share some discussion from the clojure IRC channel. I've pasted an edited version below. (I removed the user names, but if you want credit, just let me know; I'll edit and give you a link.)

The entire tag is read into memory at once in xml/parse, long before you even call count. And clojure.xml uses the ~lazy SAX parser to produce an eager concrete collection. Processing XML lazily requires a lot more work than you think - and it would be work you do, not some magic clojure.xml could do for you. Feel free to disprove by calling (count (xml/parse data-whatever)).

To summarize, even before using zip/xml-zip, this xml/parse causes an OutOfMemoryError with a large enough file:

(count (xml/parse filename))

At present, I am exploring other XML processing options. At the top of my list is clojure.data.xml as mentioned at https://stackoverflow.com/a/9946054/109618.

Ahh, yeah. Should have spotted that earlier. You definitely want `clojure.data.xml` and not `clojure.xml` - the transition should be pretty easy. — Alex, Apr 30 '13 at 15:57

score 4 · Accepted Answer · edited Apr 07 '18 at 04:52

It's a limitation of the zipper data structure. Zippers are designed for efficiently navigating trees of various sorts, with support for moving up/down/left/right in the tree hierarchy, with in-place edits in near-constant time.

From any position in the tree, the zipper needs to be able to re-construct the original tree (with edits applied). To do that, it keeps track of the current node, the parent node, and all siblings to the left and right of the current node in the tree, making heavy use of persistent data structures.

The filter functions that you're using start at the left-most child of a node and work their way one-by-one to the right, testing predicates along the way. The zipper for the left-most child starts out with an empty vector for its left-hand siblings (note the :l [] part in the source for zip/down). Each time you move right, it will add the last node visited to the vector of left-hand siblings (:l (conj l node) in zip/right). By the time you arrive at the right-most child, you've built up an in-memory vector of all the nodes in that level in the tree, which, for a wide tree like yours, could cause an OOM error.

As a workaround, if you know that the top-level element is just a container for a list of <page> elements, I'd suggest using the zipper to navigate within the page elements and just use map to process the pages:

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)]
    (map #(xml-> (zip/xml-zip %) :title text)
         (:content xml))))

So, basically, we're avoiding using the zip abstraction for the top level of the overall xml input, and thusly avoid its holding the entire xml in memory. This implies that for even huger xml, where each first-level child is huge, we may have to skip using the zipper once again in the second level of the XML structure, and so on...

This approach works, but overall this makes clojure code processing large XML files somewhat obtuse to read in hindsight: we require functions from 3 namespaces (`data.xml`, `zip`, `data.zip.xml`) and need to skip the actual zipping abstraction at the top level, to avoid holding the entire xml in memory. I wonder if there's any other library providing a smoother way. — matanster, Apr 07 '18 at 04:49

Francis Avila · Answer 2 · 2013-04-30T15:27:22.757

Looking at the source for xml-zip, it doesn't seem like it is entirely lazy:

(defn xml-zip
  "Returns a zipper for xml elements (as from xml/parse),
  given a root element"
  {:added "1.0"}
  [root]
    (zipper (complement string?) 
            (comp seq :content)
            (fn [node children]
              (assoc node :content (and children (apply vector children))))
            root))

Note (apply vector children), which is materializing the children seq to a vector (although it is not materializing the entire descendant tree, so it's still lazy). If you have a very large number of children for a node (e.g., children of <mediawiki>), then even this level of laziness is not enough--:content needs to be a seq too.

My knowledge of zippers is extremely limited, so I'm not sure why vector is being used here at all; see if replacing (assoc node :content (and children (apply vector children)))) with (assoc node :content children) works, which should keep children as a normal sequence without materializing it.

(For that matter, I'm not sure why (apply vector children) instead of (vec children)...)

content-handler looks like it is building up all content elements as well in *contents*, so the source of the OOM may be in the content-handler itself.

I'm not sure how we can reconcile the zipper interface (tree-like) with the streaming you want. It will work for large xml, but not huge xml.

In similar approaches in other languages (e.g. Python's iterparse) a tree is built up iteratively like with zipper. The difference is that the tree will be pruned after successful element processing.

For example, in Python with iterparse you would listen for an endElement event on page (i.e. when </page> occurs in the XML.) At this point you know you have a complete page element which you can process as a tree. After you are finished, you delete the element you just processed and the sibling branches, which controls memory usage.

Perhaps you can take this approach here as well. The node provided by the xml zipper is a var to an xml/element. The content handler could return a function that does cleanup on its *current* var when invoked. Then you can call it to prune the tree.

Alternatively, you could use SAX "by hand" in clojure for the root element, and create a zipper for each page element as you encounter it.

Not sure whether the vector is strictly necessary there, but I don't think it's the cause of the OOM error. The vector is used in the make-node function, which is only called when the zipper is edited in some way. That doesn't appear to be the case here. — Alex, Apr 30 '13 at 14:42

OutOfMemoryError when parsing XML in Clojure with data.zip

2 Answers2

Linked