1

I know that in clojure.string there is the split function which returns a sequence of the parts of the string excluding the given pattern.

(require '[clojure.string :as str-utils])
(str-utils/split "Yes, hello, this is dog yes hello it is me" #"hello")
;; -> ["Yes, " ", this is dog yes " " it is me"]

However, I'm trying to find a function that instead leaves the token as an element in the returned vector. So it would be like

(split-around "Yes, hello, this is dog yes hello it is me" #"hello")
;; -> ["Yes, " "hello" ", this is dog yes " "hello" " it is me"]

Is there a function that does this in any of the included libraries? Any in external libraries? I've been trying to write it myself but haven't been able to figure it out.

akond
  • 15,865
  • 4
  • 35
  • 55
TomLisankie
  • 3,785
  • 7
  • 28
  • 32
  • 1
    Isn't the missing word implicit by any two adjacent items in the returned array? What did you try to do? – Shlomi Jun 15 '20 at 03:09
  • Excellent point! That is the best answer. – Alan Thompson Jun 15 '20 at 03:38
  • @Shlomi yes, it is implicit but I need to have the string that was split on in the returned vec. In this case since the regex being split on is just a single word, yeah that works. But say the regex was `\[\[.*?\]\]`. In that case, there's a good chance there'll be things like `[[hello]]` and `[[yes]]` and I need to know what the text that was matched on is and where in the string it was. – TomLisankie Jun 15 '20 at 04:23

4 Answers4

6

you can also use the regex lookahead/lookbehind feature for that:

user> (clojure.string/split "Yes, hello, this is dog yes hello it is me" #"(?<=hello)|(?=hello)")
;;=> ["Yes, " "hello" ", this is dog yes " "hello" " it is me"]

you can read it as "split with zero-length string at point where preceding or subsequent word is 'hello'"

notice, that it also ignores the dangling empty strings for adjacent patterns and leading/trailing ones:

user> (clojure.string/split "helloYes, hello, this is dog yes hellohello it is mehello" #"(?<=hello)|(?=hello)")
;;=> ["hello"
;;    "Yes, "
;;    "hello"
;;    ", this is dog yes "
;;    "hello"
;;    "hello"
;;    " it is me"
;;    "hello"]

you can wrap it into a function like this, for example:

(defn split-around [source word]
  (let [word (java.util.regex.Pattern/quote word)]
    (->> (format "(?<=%s)|(?=%s)" word word)       
         re-pattern
         (clojure.string/split source))))
leetwinski
  • 17,408
  • 2
  • 18
  • 42
  • 3
    I think this is the best answer – Denis Fuenzalida Jun 15 '20 at 06:26
  • 2
    Your `split-around` will fail if the word to split on contains regex metacharacters. If you want to generalize beyond "hello", use `Pattern/quote` to quote the metacharacters. – amalloy Jun 15 '20 at 08:59
  • @amalloy fair enough – leetwinski Jun 15 '20 at 09:48
  • 1
    This doesn't always produce correct output. E.g. `(split-around "hahaha" "haha")` produces `["ha" "ha" "ha"]` but, based on `split`'s behavior, it should produce `["" "haha" "ha"]` or `["haha" "ha"]`. – peter pun Jun 16 '20 at 01:14
  • Also notice that using both lookahead and lookbehind makes this solution slightly inefficient. For example, most occurrences of the separator-word will be recognized twice. – peter pun Jun 16 '20 at 01:19
3
(-> "Yes, hello, this is dog yes hello it is me"
    (str/replace #"hello" "~hello~")
    (str/split #"~"))
akond
  • 15,865
  • 4
  • 35
  • 55
0

Example using @Shlomi's solution:

(ns tst.demo.core
  (:use tupelo.core tupelo.test)
  (:require [clojure.string :as str]))

(dotest
  (let [input-str "Yes, hello, this is dog yes hello it is me"
        segments  (mapv str/trim
                    (str/split input-str #"hello"))
        result    (interpose "hello" segments)]
    (is= segments ["Yes," ", this is dog yes" "it is me"])
    (is= result ["Yes," "hello" ", this is dog yes" "hello" "it is me"])))

Update

Might be best to write a custom loop for this use case. Something like:

(ns tst.demo.core
  (:use tupelo.core tupelo.test)
  (:require
    [clojure.string :as str] ))

(defn strseg
  "Will segment a string like '<a><tgt><b><tgt><c>' at each occurrence of `tgt`, producing
   an output vector like [ <a> <tgt> <b> <tgt> <c> ]."
  [tgt source]
  (let [tgt-len  (count tgt)
        segments (loop [result []
                        src    source]
                   (if (empty? src)
                     result
                     (let [i (str/index-of src tgt)]
                       (if (nil? i)
                         (let [result-next (into result [src])
                               src-next    nil]
                           (recur result-next src-next))
                         (let [pre-tgt     (subs src 0 i)
                               result-next (into result [pre-tgt tgt])
                               src-next    (subs src (+ tgt-len i))]
                           (recur result-next src-next))))))
        result   (vec
                   (remove (fn [s] (or (nil? s)
                                     (empty? s)))
                     segments))]
    result))

with unit tests

(dotest
  (is= (strseg "hello" "Yes, hello, this is dog yes hello it is me")
    ["Yes, " "hello" ", this is dog yes " "hello" " it is me"] )
  (is= (strseg "hello" "hello")
    ["hello"])
  (is= (strseg "hello" "") [])
  (is= (strseg "hello" nil) [])
  (is= (strseg "hello" "hellohello") ["hello" "hello" ])
  (is= (strseg "hello" "abchellodefhelloxyz") ["abc" "hello" "def" "hello" "xyz" ])
  )
Alan Thompson
  • 29,276
  • 6
  • 41
  • 48
0

Here is another solution that avoids the problems with repetitive patterns and double recognitions present in leetwinski's answer (see my comments) and also computes the parts lazily-as-possible:

(defn partition-str [s sep]
  (->> s
       (re-seq
         (->> sep
              java.util.regex.Pattern/quote ; remove this to treat sep as a regex
              (format "((?s).*?)(?:(%s)|\\z)")
              re-pattern))
       (mapcat rest)
       (take-while some?)
       (remove empty?))) ; remove this to keep empty parts

HOWEVER this does not behave correctly/intuitively when the separator is/matches the empty string.

Another way could be to use both re-seq and split with the same pattern and interleave the resulting sequences as shown in this related question. Unfortunately this way every occurrence of the separator will be recognized twice.

Perhaps a better approach would be to build on a more primitive basis using re-matcher and re-find.

Finally, to offer a straighter answer to the initial question, there is no such function in Clojure's standard library or any external library AFAIK. Moreover I don't know of any simple and completely unproblematic solution to this problem (especially with a regex-separator).


UPDATE

Here is the best solution I can think of right now, working on a lower level, lazily and with a regex-separator:

(defn re-partition [re s]
  (let [mr (re-matcher re s)]
    ((fn rec [i]
       (lazy-seq
         (if-let [m (re-find mr)]
           (list* (subs s i (.start mr)) m (rec (.end mr)))
           (list (subs s i)))))
     0)))

(def re-partition+ (comp (partial remove empty?) re-partition))

Notice that we can (re)define:

(def re-split (comp (partial take-nth 2) re-partition))

(def re-seq (comp (partial take-nth 2) rest re-partition))
peter pun
  • 384
  • 1
  • 8