idiomatic way to use clojure to get set of unique words in vector of strings

Question

I'm a complete newbie to clojure, so please forgive the stupidity below... but I'm trying to split a vector of strings on spaces, and then get all the unique strings from the whole resulting vector of vectors in a single sequence (where I'm not picky about the type of sequence). Here's the code I tried.

(require '[clojure.string :as str])
(require '[clojure.set :as set])
(def documents ["this is a cat" "this is a dog" "woof and a meow"])
(apply set/union (map #(str/split % #" ") documents))

I would have expected this to return a set of unique words, i.e.,

#{"woof" "and" "a" "meow" "this" "is" "cat" "dog"}

but instead it returns a vector of non-unique words, i.e.,

["woof" "and" "a" "meow" "this" "is" "a" "cat" "this" "is" "a" "dog"]

Ultimately, I just wrapped that in a set call, i.e.,

(set (apply set/union (map #(str/split % #" ") documents)))

and got what I wanted:

#{"dog" "this" "is" "a" "woof" "and" "meow" "cat"}

but I don't quite understand why that should be the case. According to the docs the union function returns a set. So why'd I get a vector?

Second question: an alternative approach is just

(distinct (apply concat (map #(str/split % #" ") documents)))

which also returns what I want, albeit in list form rather than in set form. But some of the discussion on this prior SO suggests that concat is unusually slow, perhaps slower than set operations (?).

Is that right... and is there any other reason to prefer one to the other approach (or some third approach)?

I don't really care whether I get a vector or a set coming out the other end, but will ultimately care about performance considerations. I'm trying to learn Clojure by actually producing something that will be useful for my text-mining habit, and so ultimately this bit of code will be part of workflow to handle large amounts of text data efficiently... the time for getting it right, performance-wise and just general not-being-stupid-wise, is now.

Thanks!

score 8 · Accepted Answer · answered Mar 01 '16 at 01:19

8

clojure.set/union operates on sets but you gave it sequences instead (the result of str/split is a sequence of strings).

(set (mapcat #(str/split % #" ") documents)) should give you what you need.

mapcat will do a lazy "map and concatenate" operation. set will convert that sequence into set, discarding duplicates as it goes.

answered Mar 01 '16 at 01:19

Sean Corfield

6,297
22
31

Thanks. I had assumed the union function would return a set no matter what it gets passed... guess not! – Paul Gowder Mar 01 '16 at 04:03
2

@PaulGowder It might help to think of the union function having a contract -- the programmer's side of the contract is to pass sets to union, and the function's side of the contract is to return a set. Passing vectors instead of sets broke the contract, so union may or may not fulfill its end of the bargain. It might be less disconcerting if it had reported an error about its input, but over time you will likely see this as less of an issue. – Brian Mar 01 '16 at 14:28
1

@PaulGowder If you look at the source code, you'll find that `clojure.set/union` `conj`es the elements of the smaller collection into the larger. So, for example, `(clojure.set/union (set (range 10)) (range 3))` works, but `(clojure.set/union (set (range 3))n (range 10))` returns the *sequence* `(2 1 0 0 1 2 3 4 5 6 7 8 9)`. As @Brian implies, you have to regard this behaviour as an accident of implementation, which may change in future. – Thumbnail Mar 01 '16 at 15:58

idiomatic way to use clojure to get set of unique words in vector of strings

1 Answers1