I do not like the way zippers work in Clojure, and I've not looked at clojure.zip/xml-zip
or clojure.data.zip/xml->
(confusing that they are two separate libs!).
Instead, may I suggest you try out the tupelo.forest
library? Here is an overview from the 2017 Clojure/Conj.
Below is a live solution using tupelo.forest
. I added a second sentence to make it more interesting:
(dotest
(with-forest (new-forest)
(let [xml-str (ts/quotes->double
"<document>
<sentence id='1'>
<word id='1.1'>foo</word>
<word id='1.2'>bar</word>
</sentence>
<sentence id='2'>
<word id='2.1'>beyond</word>
<word id='2.2'>all</word>
<word id='2.3'>recognition</word>
</sentence>
</document>")
root-hid (add-tree-xml xml-str)
>> (remove-whitespace-leaves)
bush-no-blanks (hid->bush root-hid)
sentence-hids (find-hids root-hid [:document :sentence])
sentences (forv [sentence-hid sentence-hids]
(let [word-hids (hid->kids sentence-hid)
words (mapv #(grab :value (hid->leaf %)) word-hids)
sentence-text (str/join \space words)]
sentence-text))
]
(is= bush-no-blanks
[{:tag :document}
[{:id "1", :tag :sentence}
[{:id "1.1", :tag :word, :value "foo"}]
[{:id "1.2", :tag :word, :value "bar"}]]
[{:id "2", :tag :sentence}
[{:id "2.1", :tag :word, :value "beyond"}]
[{:id "2.2", :tag :word, :value "all"}]
[{:id "2.3", :tag :word, :value "recognition"}]]])
(is= sentences
["foo bar"
"beyond all recognition"]))))
The idea is to find the hid
(Hex ID, like a pointer) for each sentence. In the forv
loop, we find the child nodes for each sentence, extract the :value
, and joint into a string. The unit tests show the tree structure as parsed from XML (after deleting blank nodes) and the final result. Note that we ignore the id
fields and use only the tree structure to understand the sentences.
Documentation for tupelo.forest
is still a work in progress, but you can see many live examples here.
The Tupelo project lives on GitHub.\
Update
I have been thinking about the streaming data problem, and have added a new function proc-tree-enlive-lazy
to enable lazy processing of large data sets. Here is an example:
(let [xml-str (ts/quotes->double
"<document>
<sentence id='1'>
<word id='1.1'>foo</word>
<word id='1.2'>bar</word>
</sentence>
<sentence id='2'>
<word id='2.1'>beyond</word>
<word id='2.2'>all</word>
<word id='2.3'>recognition</word>
</sentence>
</document>")
(let [enlive-tree-lazy (clojure.data.xml/parse (StringReader. xml-str))
doc-sentence-handler (fn [root-hid]
(remove-whitespace-leaves)
(let [sentence-hid (only (find-hids root-hid [:document :sentence]))
word-hids (hid->kids sentence-hid)
words (mapv #(grab :value (hid->leaf %)) word-hids)
sentence-text (str/join \space words)]
sentence-text))
result-sentences (proc-tree-enlive-lazy enlive-tree-lazy
[:document :sentence] doc-sentence-handler)]
(is= result-sentences ["foo bar" "beyond all recognition"])) ))
The idea is that you process successive subtrees, in this case whenever you get a subtree path of [:document :sentence]
. You pass in a handler function, which will receive the root-hid
of a tupelo.forest tree
. The return value of the handler is then placed onto an output lazy sequence returned to the caller.