Implementing a simple scanner/tokenizer using a regex in clojure

Question

I am trying to write a function named scan-for which, taking as input a collection of strings (the "tokens"), returns a "tokenizer" function which, taking as input a string, returns a (preferably lazy) sequence of strings consisting of the "tokens" contained in the input, recognized in a greedy manner, and the non-empty substrings between them and in the start and end, in the order they appear in the input.

For example ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities") should produce:

("ban" " " "ban" "an" "as " "an" "d" " " "banal" "ities")

In my first attempt, I use a regex to match the "tokens" (with re-seq) and to find the intervening substrings (with split) and then interleave the resulting sequences. The problem is that the input string is parsed twice with the constructed regex and that the resulting sequence is not lazy, because of split.

[In the definition of scan-for I use the tacit/point-free style (avoiding lambdas and their sugared disguises) which I find elegant and useful in general (John Backus would probably agree). In clojure this requires extended use of partial to take care of uncurried functions. If you don't like it, you can add lambdas, threading-macros etc.]

(defn rpartial
  "a 'right' version of clojure.core/partial"
  [f & args] #(apply f (concat %& args)))

(defn interleave*
  "a 'continuing' version of clojure.core/interleave"
  [& seqs]
  (lazy-seq
    (when-let [seqs (seq (remove empty? seqs))]
      (concat
        (map first seqs)
        (apply interleave* (map rest seqs))))))

(defn make-fn
  "makes a function from a symbol and an (optional) arity"
  ([sym arity]
   (let [args (repeatedly arity gensym)]
     (eval (list `fn (vec args) (cons sym args)))))
  ([sym] (make-fn sym 1)))

(def scan-for
  (comp
    (partial comp
      (partial remove empty?)
      (partial apply interleave*))
    (partial apply juxt)
    (juxt
      (partial rpartial clojure.string/split)
      (partial partial re-seq))
    re-pattern
    (partial clojure.string/join \|)
    (partial map (make-fn 'java.util.regex.Pattern/quote))
    (partial sort (comp not neg? compare))))

In my second attempt, I use a regex to match the "tokens" and the intervening single symbols and then group these single symbols. Here I don't like the amount of processing done outside of regex matching.

(defn scan-for [tokens]
  (comp
    (partial remove empty?)
    (fn group [s]
      (lazy-seq
        (if-let [[sf & sr] s]
          (if (or (get sf 1)
                  (some (partial = sf) tokens))
            (list* "" sf (group sr))
            (let [[gf & gr] (group sr)]
              (cons (str sf gf) gr)))
          (cons "" nil))))
    (->> tokens
         (sort (comp not neg? compare))
         (map #(java.util.regex.Pattern/quote %))
         (clojure.string/join \|)
         (#(str % "|(?s)."))
         (re-pattern)
         (partial re-seq))))

So is there any way to do this using some suitable regex to parse the input once and doing minimal processing outside of that parsing?

(A lazy version of split which also returns the regex matches would be helpful ...if it existed.)

Denis Fuenzalida · Answer 1 · 2019-06-15T03:54:29.550

Here's a quick-and-dirty version which is not lazy, but I think could be turned into a lazy one by using lazy-seq in a few places as you did in your version:

(defn scan-for
  ([tokens text unmatched xs]
   (if (empty? text)
     (concat xs [unmatched])
     (let [matching (filter #(clojure.string/starts-with? text %) tokens)]
       (if (empty? matching)
         (recur tokens (subs text 1) (str unmatched (subs text 0 1)) xs)
         (let [matched (first matching)]
           (recur tokens
                  (subs text (count matched))
                  ""
                  (concat xs
                          (when-not (empty? unmatched) [unmatched])
                          [matched])))))))
  ([tokens text]
   (scan-for tokens text "" [])))


;; (scan-for ["an" "ban" "banal" "d"] "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")

Edit:

This was a very interesting one so I had to try it. I found that clojure.string/split also takes an optional parameter with a limit on the number of splits it will produce. Assuming that reaching the limit won't scan the rest of the input, you can implement it based on your original suggestions:

(defn create-regex [xs]
  (->> xs (interpose "|") (apply str) re-pattern))

(defn split-lazy [s re]
  (when-not (empty? s)
    (let [[part remaining] (clojure.string/split s re 2)]
      (lazy-seq (cons part (split-lazy remaining re))))))

(defn scan-lazy [xs s]
  (let [re         (create-regex xs)
        no-matches (split-lazy s re)
        matches    (concat (re-seq re s) (repeat nil))]
    (remove empty?
            (interleave no-matches matches))))

(defn scan-for [xs] (partial scan-lazy xs))

;; ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")

In the code above I use a trick where matches is padded with nils so that interleave can consume both collections, otherwise it will stop when one of them ends.

You can check it's lazy too:

bananas.core> (def bananas ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities"))
#'bananas.core/bananas
bananas.core> (realized? bananas)
false
bananas.core> bananas
("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")
bananas.core> (realized? bananas)
true

Edit 2:

If you sort the tokens by decreasing length, you get the "greedy" version that you expected:

(defn create-regex [xs]
  (->> xs (sort-by count) reverse (interpose "|") (apply str) re-pattern))

;; ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "banal" "ities")

Thanks for the try but it misses the point of my question. As you can see in the title, tags and final question in my text, I want most of the processing to be done using a regex (I am curious to know if such "regex wizardry" can be done as my knowledge on the topic is only basic). Another flaw is the lack of laziness that you also mentioned. The use of the `xs` argument to build the result means that some considerable modification must be done to achieve laziness (...but it's doable). — peter pun, Jun 13 '19 at 18:20
Other easily fixable problems include the following: (a) The tokens are not matched greedily. (`matched` should not be the first item of `matching` but the longest one. As you see in my example, `"banalities"` should be split as `("banal" "ities")`.) (b) If there is nothing unmatched in the end, an empty string is added. (c) The function is not curried. — peter pun, Jun 13 '19 at 18:22
...and one mostly stylistic modification to your code: As the 4-argument version of `scan-for` is not supposed to be used externally, I would use only 2 arguments and write `(loop [text text unmatched "" xs []] ... )` ommiting the `tokens` at the `recur`. — peter pun, Jun 13 '19 at 19:41
I wrote an improved version which I think should meet the criteria for building a regex, being lazy and creating a curried/partial function, please check the edit — Denis Fuenzalida, Jun 15 '19 at 03:29
The lazification of `split` is a really nice improvement! However, this solution still does not meet the criterion that an occurrence of a searched token in the input is recognized only once (as both `split-lazy` and `re-seq` will recognize the same occurrence). — peter pun, Jun 15 '19 at 17:03
Also note that things like checking if a returned sequence is a `LazySeq`, if it is `realized?` etc. do not test "real laziness". The tests you did succeed for my first solution too, although it is not "really lazy" (yours is "really lazy"). This has to do with the fact that the thunks used for a `LazySeq`, being closures, can hold values that are computed when this `LazySeq` is created (not when an element is realized). A close look at the implementation details of lazy sequences should make these points clearer... — peter pun, Jun 15 '19 at 18:05
...For example, composing any sequence-generating function with the `unchunk` function defined [here](https://stackoverflow.com/a/3409568/11419548) does not magically lazify it, although it then generates a `LazySeq`. For another example, see the edit in [this question](https://stackoverflow.com/q/55881144/11419548). — peter pun, Jun 15 '19 at 18:06

peter pun · Accepted Answer · 2019-06-18T00:00:34.663

I have come up with a solution that looks acceptable. It is based on the use of capturing groups and the *? quantifier, the reluctant/non-greedy version of *.

Here it is:

(defn scan-for [tokens]
  (comp
    (partial remove empty?)
    flatten
    (partial map rest)
    (->> tokens
         (sort (comp not neg? compare)) ;alternatively, we can short by decreasing length
         (map #(java.util.regex.Pattern/quote %))
         (clojure.string/join \|)
         (#(str "((?s).*?)(" % "|\\z)"))
         (re-pattern)
         (partial re-seq))))

And in the tacit style:

(def scan-for
  (comp
    (partial comp
      (partial remove empty?)
      flatten
      (partial map rest))
    (partial partial re-seq)
    re-pattern
    (partial str "((?s).*?)(")
    (rpartial str "|\\z)")
    (partial clojure.string/join \|)
    (partial map (make-fn 'java.util.regex.Pattern/quote))
    (partial sort (comp not neg? compare))))

Implementing a simple scanner/tokenizer using a regex in clojure

2 Answers2

Linked