I'm a complete newbie to clojure, so please forgive the stupidity below... but I'm trying to split a vector of strings on spaces, and then get all the unique strings from the whole resulting vector of vectors in a single sequence (where I'm not picky about the type of sequence). Here's the code I tried.
(require '[clojure.string :as str])
(require '[clojure.set :as set])
(def documents ["this is a cat" "this is a dog" "woof and a meow"])
(apply set/union (map #(str/split % #" ") documents))
I would have expected this to return a set of unique words, i.e.,
#{"woof" "and" "a" "meow" "this" "is" "cat" "dog"}
but instead it returns a vector of non-unique words, i.e.,
["woof" "and" "a" "meow" "this" "is" "a" "cat" "this" "is" "a" "dog"]
Ultimately, I just wrapped that in a set call, i.e.,
(set (apply set/union (map #(str/split % #" ") documents)))
and got what I wanted:
#{"dog" "this" "is" "a" "woof" "and" "meow" "cat"}
but I don't quite understand why that should be the case. According to the docs the union function returns a set. So why'd I get a vector?
Second question: an alternative approach is just
(distinct (apply concat (map #(str/split % #" ") documents)))
which also returns what I want, albeit in list form rather than in set form. But some of the discussion on this prior SO suggests that concat is unusually slow, perhaps slower than set operations (?).
Is that right... and is there any other reason to prefer one to the other approach (or some third approach)?
I don't really care whether I get a vector or a set coming out the other end, but will ultimately care about performance considerations. I'm trying to learn Clojure by actually producing something that will be useful for my text-mining habit, and so ultimately this bit of code will be part of workflow to handle large amounts of text data efficiently... the time for getting it right, performance-wise and just general not-being-stupid-wise, is now.
Thanks!