I want to use Clojure to extract the titles from a Wiktionary XML dump.
I used head -n10000 > out-10000.xml
to create smaller versions of the original monster file. Then I trimmed with a text editor to make it valid XML. I renamed the files according to the number of lines inside (wc -l
):
(def data-9764 "data/wiktionary-en-9764.xml") ; 354K
(def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M
(def data-995066 "data/wiktionary-en-995066.xml") ; 34M
(def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M
Here is the overview of the XML structure:
<mediawiki>
<page>
<title>dictionary</title>
<revision>
<id>20100608</id>
<parentid>20056528</parentid>
<timestamp>2013-04-06T01:14:29Z</timestamp>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>
Here is what I've tried, based on this answer to 'Clojure XML Parsing':
(ns example.core
(:use [clojure.data.zip.xml :only (attr text xml->)])
(:require [clojure.xml :as xml]
[clojure.zip :as zip]))
(defn titles
"Extract titles from +filename+"
[filename]
(let [xml (xml/parse filename)
zipped (zip/xml-zip xml)]
(xml-> zipped :page :title text)))
(count (titles data-9764))
; 38
(count (titles data-99224))
; 779
(count (titles data-995066))
; 5172
(count (titles data-7999931))
; OutOfMemoryError Java heap space java.util.Arrays.copyOfRange (Arrays.java:3209)
Am I doing something wrong in my code? Or is this perhaps a bug or limitation in the libraries I'm using? Based on REPL experimentation, it seems like the code I'm using is lazy. Underneath, Clojure uses a SAX XML parser, so that alone should not be the problem.
See also:
Update 2013-04-30:
I'd like to share some discussion from the clojure IRC channel. I've pasted an edited version below. (I removed the user names, but if you want credit, just let me know; I'll edit and give you a link.)
The entire tag is read into memory at once in
xml/parse
, long before you even call count. Andclojure.xml
uses the ~lazy SAX parser to produce an eager concrete collection. Processing XML lazily requires a lot more work than you think - and it would be work you do, not some magicclojure.xml
could do for you. Feel free to disprove by calling(count (xml/parse data-whatever))
.
To summarize, even before using zip/xml-zip
, this xml/parse
causes an OutOfMemoryError
with a large enough file:
(count (xml/parse filename))
At present, I am exploring other XML processing options. At the top of my list is clojure.data.xml as mentioned at https://stackoverflow.com/a/9946054/109618.