I am trying to write a function named scan-for
which, taking as input a collection of strings (the "tokens"), returns a "tokenizer" function which, taking as input a string, returns a (preferably lazy) sequence of strings
consisting of the "tokens" contained in the input, recognized in a greedy manner, and the non-empty substrings between them and in the start and end, in the order they appear in the input.
For example ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
should produce:
("ban" " " "ban" "an" "as " "an" "d" " " "banal" "ities")
In my first attempt, I use a regex to match the "tokens" (with re-seq
) and to find the intervening substrings (with split
) and then interleave the resulting sequences. The problem is that the input string is parsed twice with the constructed regex and that the resulting sequence is not lazy, because of split
.
[In the definition of scan-for
I use the tacit/point-free style (avoiding lambdas and their sugared disguises) which I find elegant and useful in general (John Backus would probably agree). In clojure this requires extended use of partial
to take care of uncurried functions. If you don't like it, you can add lambdas, threading-macros etc.]
(defn rpartial
"a 'right' version of clojure.core/partial"
[f & args] #(apply f (concat %& args)))
(defn interleave*
"a 'continuing' version of clojure.core/interleave"
[& seqs]
(lazy-seq
(when-let [seqs (seq (remove empty? seqs))]
(concat
(map first seqs)
(apply interleave* (map rest seqs))))))
(defn make-fn
"makes a function from a symbol and an (optional) arity"
([sym arity]
(let [args (repeatedly arity gensym)]
(eval (list `fn (vec args) (cons sym args)))))
([sym] (make-fn sym 1)))
(def scan-for
(comp
(partial comp
(partial remove empty?)
(partial apply interleave*))
(partial apply juxt)
(juxt
(partial rpartial clojure.string/split)
(partial partial re-seq))
re-pattern
(partial clojure.string/join \|)
(partial map (make-fn 'java.util.regex.Pattern/quote))
(partial sort (comp not neg? compare))))
In my second attempt, I use a regex to match the "tokens" and the intervening single symbols and then group these single symbols. Here I don't like the amount of processing done outside of regex matching.
(defn scan-for [tokens]
(comp
(partial remove empty?)
(fn group [s]
(lazy-seq
(if-let [[sf & sr] s]
(if (or (get sf 1)
(some (partial = sf) tokens))
(list* "" sf (group sr))
(let [[gf & gr] (group sr)]
(cons (str sf gf) gr)))
(cons "" nil))))
(->> tokens
(sort (comp not neg? compare))
(map #(java.util.regex.Pattern/quote %))
(clojure.string/join \|)
(#(str % "|(?s)."))
(re-pattern)
(partial re-seq))))
So is there any way to do this using some suitable regex to parse the input once and doing minimal processing outside of that parsing?
(A lazy version of split
which also returns the regex matches would be helpful ...if it existed.)