6

I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.

My code (partly taken from Justin Kramer answer) looks like that:

(defn process-pages
  [page]
  (let [title     (article-title page)
        revisions (filter #(= :revision (:tag %)) (:content page))]
    (for [revision revisions]
      (let [user (revision-user revision)
            time (revision-timestamp revision)]
        (spit "files/data.csv"
              (str "\"" time "\";\"" user "\";\"" title "\"\n" )
              :append true)))))

(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
  (->> (:content (data.xml/parse rdr :coalescing false))
       (filter #(= :page (:tag %)))
       (map process-pages))))

I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.

trincot
  • 317,000
  • 35
  • 244
  • 286
trzewiczek
  • 1,791
  • 1
  • 13
  • 15

3 Answers3

4

Just to be clear, (:content (data.xml/parse rdr :coalescing false)) IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.

That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.

Here's what I recommend:

1) Add (dorun) to the end of the ->> chain of calls. This will force the sequence to be fully realized without holding onto the head.

2) Change for in process-page to doseq. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.

As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening & writing (spit) for every Wikipedia entry.

UPDATE:

Here's a rewrite which attempts to separate concerns more clearly:

(defn filter-tag [tag xml]
  (filter #(= tag (:tag %)) xml))

;; lazy
(defn revision-seq [xml]
  (for [page (filter-tag :page (:content xml))
        :let [title (article-title page)]
        revision (filter-tag :revision (:content page))
        :let [user (revision-user revision)
              time (revision-timestamp revision)]]
    [time user title]))

;; eager
(defn transform [in out]
  (with-open [r (io/input-stream in)
              w (io/writer out)]
    (binding [*out* out]
      (let [xml (data.xml/parse r :coalescing false)]
        (doseq [[time user title] (revision-seq xml)]
          (println (str "\"" time "\";\"" user "\";\"" title "\"\n")))))))

(transform "dump.xml" "data.csv")

I don't see anything here that would cause excessive memory use.

Justin Kramer
  • 3,983
  • 1
  • 23
  • 23
  • 1
    The point about dorun could be made a little clearer for someone new to Clojure: the open-file function as shown in the question returns the sequence of results of calls to process-pages, and when the function is called from the repl, printing the sequence causes all the results to be held in memory at the same time. Calling dorun on the result causes the elements of the sequence to be evaluated and nil to be returned, so that there is never a need to have all the results in memory at the same time. – Jouni K. Seppänen Apr 03 '12 at 14:58
  • Thanx for the explanation! I do understand (hopefully) now how the laziness works in this code snippet and changed what you proposed, but still `OutOfMemoryError: Java heap space`. I'm working on a 1GB sample of the final file, but it still kicks the memory error. Would be really thankful for any help. – trzewiczek Apr 03 '12 at 18:06
  • See my latest update. If you still get OutOfMemory error, I'm not sure why. I used code very similar to this with no memory problems. – Justin Kramer Apr 03 '12 at 18:33
  • Ideas for troubleshooting: does it always run out of memory on the same item? Is that item out of the ordinary (e.g., really big, lots of revisions)? Have you tried giving the JVM more memory? Are you sure you're not holding onto any substrings anywhere (the JVM does not GC strings which have substrings still in use)? – Justin Kramer Apr 03 '12 at 18:46
  • Basically - THX A LOT for all you help. A spent some more hours with it and it's too complicated for me in terms of JVM tweaks and more I try to use some memory options more fancy errors I get. Probably will spend some more time with Clojure and JVM before I'll be able to handle this problem correctly. – trzewiczek Apr 05 '12 at 15:16
  • I have a similar issue with a large (150MB) file, but this technique did not work for me. I'm wondering if this is a depth vs. breadth issue with the data structures? For my file, there are nested levels (~6) and I need information from each. – bnbeckwith Nov 02 '12 at 20:49
1

Unfortunately data.xml/parse is not lazy, it attempts to read the whole file into memory and then parse it.

Instead use the this (lazy) xml library which holds only the part it is currently working on in ram. You will then need to re-structure your code to write the output as it reads the input instead of gathering all the xml, then outputting it.

your line

(:content (data.xml/parse rdr :coalescing false)

will load all the xml into memory and then request the content key from it. which will blow the heap.

a rough outline of a lazy answer would look something like this:

(with-open [input (java.io.FileInputStream. "/tmp/foo.xml")
            output (java.io.FileInputStream. "/tmp/foo.csv"]
    (map #(write-to-file output %)
        (filter is-the-tag-i-want? (parse input))))

Have patience, working with (> data ram) always takes time :)

TacticalCoder
  • 6,275
  • 3
  • 31
  • 39
Arthur Ulfeldt
  • 90,827
  • 27
  • 201
  • 284
0

I don't know about Clojure but in plain Java one could use a SAX event based parser like http://docs.oracle.com/javase/1.4.2/docs/api/org/xml/sax/XMLReader.html that doesn't need to load the XML to RAM

Niklas Schnelle
  • 1,139
  • 1
  • 9
  • 11