One-line implementation of split in Haskell

Question

What I want is the following one (which I think should be included in prelude since it is very useful in text processing):

split :: Eq a => [a] -> [a] -> [[a]]

e.g:

split ";;" "hello;;world" = ["hello", "world"]

split from Data.List.Utils isn't in base. I feel there should be a short-and-sweet implementation by composing a few base functions, but I can't figure it out. Am I missing something?

"Write this code for me" isn't really a good SO question. And "Write this code for me with some arbitrary constraint" seems even worse. Note that this function is traditionally called `splitOn` rather than `split`. Maybe Hoogle for some examples and see what's already out there. — K. A. Buhr, Mar 10 '18 at 16:12
([Meta double dupe](https://meta.stackoverflow.com/questions/303701/when-a-users-question-amounts-to-write-this-code-for-me-what-is-the-best-cou)). — user202729, Mar 10 '18 at 16:16
[Related](https://stackoverflow.com/questions/4978578/how-to-split-a-string-in-haskell). — user202729, Mar 10 '18 at 16:18
Everything in Haskell can be reduced to a one-line implementation. — chepner, Mar 10 '18 at 19:26
There's [Data.List.Split.splitOn](https://hackage.haskell.org/package/split/docs/Data-List-Split.html#v:splitOn), which seems to be included when in my Haskell environment. — Mark Seemann, Mar 11 '18 at 10:20
@K. A. Buhr firstly, I post this question to [code golf](https://codegolf.stackexchange.com/questions/157792/one-line-implementation-of-split-in-haskell), and they think this should be asked here, so I post this here... I'm wondering why this question can't be asked?? — luochen1990, Mar 11 '18 at 15:13
@chepner When I say "one-line implementation", I mean a very short and direct implementation, sorry for not explain this very clear. — luochen1990, Mar 11 '18 at 15:19
@user202729 This question is very different since it is not asking for some packages to import, it is asking for some easy and direct implementation. — luochen1990, Mar 11 '18 at 15:21
@K.A.Buhr There are many third-party packages who implemented this function, with different names, maybe `splitOn` is the most used one. giving the type signature and usage sample, I think there is no ambiguity, what do you think? — luochen1990, Mar 11 '18 at 15:24
Because it is not nice to ask someone else to do the whole work for you. I don't know, but such questions are not generally upvoted (with some exceptions) — user202729, Mar 11 '18 at 15:27
I'm not asking someone else to do the work for me, I have found the package, and I also searched the related questions, but I never get a short and direct implementation of this function, I think this is an interesting question, and want to see if there is somebody else know the answer. I think this is much like some 'golf' question, so I firstly post it there, but not welcomed... and then here,, but still not welcomed... — luochen1990, Mar 11 '18 at 15:31
While I do think there is an interesting question in here, and I will write an answer to it once I get some free time, the way it is phrased is a bit of a downvote magnet, unfortunately, for the reasons others have mentioned. I feel you'd have had better luck with something like "`split` from `Data.List.Utils` isn't in *base*. I feel there should be a short-and-sweet implementation by composing a few *base* functions, but I can't figure it out. Am I missing something?" — duplode, Mar 12 '18 at 15:23
@duplode Thanks for your words, I feel better now. I think there are many people in this platform voting by their feeling, without understand of the question itself... Maybe my English is not good enough to embellish my words... — luochen1990, Mar 15 '18 at 13:47
Also, some people vote a question based on the existing votes. ... — user202729, Mar 18 '18 at 13:35
@duplode OP had followed your advice but (my guess is) not many of the downvotes were undone. related: https://meta.stackexchange.com/questions/194108/responsibility-of-close-voters-to-re-examine-edited-question .......... — Will Ness, Sep 21 '18 at 07:21
@WillNess Indeed. The system doesn't encourage folks to revisit their votes, and there is a lot of push back against suggestions along those lines ([I tried that once, too](https://meta.stackoverflow.com/q/336788/2751851)). To be fair, I must admit I hadn't actually noticed luochen1990's edit until now, even though I did read the comment accompanying the edit back when it was posted. — duplode, Sep 21 '18 at 12:54
@duplode all the more evidence that a reminder from the system would go a long way. :) :) — Will Ness, Sep 21 '18 at 13:26

duplode · Answer 1 · 2018-09-21T12:46:29.357

Arguably, the best way to check how feasible a short-and-sweet splitOn (or split, as you and MissingH call it -- here I will stick to the name used by the split and extra packages) is trying to write it [note 1].

(By the way, I will use recursion-schemes functions and concepts in this answer, as I find systematising things a bit helps me think about this kind of problem. Do let me know if anything is unclear.)

The type of splitOn is [note 2]:

splitOn :: Eq a => [a] -> [a] -> [[a]]

One way to write a recursive function that builds one data structure from another, like splitOn does, begins by asking whether to do it by walking the original structure in a bottom-up or a top-down way (for lists, that amounts to right-to-left and left-to-right respectively). A bottom-up walk is more naturally expressed as some kind of fold:

foldr @[] :: (a -> b -> b) -> b -> [a] -> b
cata @[_] :: (ListF a b -> b) -> [a] -> b

(cata, short for catamorphism, is how recursion-schemes expresses a vanilla fold. The ListF a b -> b function, called an algebra in the jargon, specifies what happens in each fold step. data ListF a b = Nil | Cons a b, and so, in the case of lists, the algebra amounts to the two first arguments of foldr rolled into one -- the binary function corresponds to the Cons case, and the seed of the fold, to the Nil one.)

A top-down walk, on the other hand, lends itself to an unfold:

unfoldr :: (b -> Maybe (a, b)) -> b -> [a] -- found in Data.List
ana @[_] :: (b -> ListF a b) -> b -> [a]

(ana, short for anamorphism, is the vanilla unfold in recursion-schemes. The b -> ListF a b function is a coalgebra; it specifies what happens in each unfold step. For a list, the possibilities are either emitting a list element and an updated seed or generating an empty list and terminating the unfold.)

Should splitOn be bottom-up or top-down? To implement it, we need to, at any given position in the list, look ahead in order to check whether the current list segment starts with the delimiter. That being so, it makes sense to reach for a top-down solution i.e. an unfold/anamorphism.

Playing with ways to write splitOn as an unfold shows another thing to consider: you want each individual unfold step to generate a fully-formed list chunk. Not doing so will, at best, lead you to unnecessarily walk the original list twice [note 3]; at worst, catastrophic memory usage and stack overflows on long list chunks await [note 4]. One way to achieve that is through a breakOn function, like the one in Data.List.Extra...

breakOn :: Eq a => [a] -> [a] -> ([a], [a])

... which is like break from the Prelude, except that, instead of applying a predicate to each element, it checks whether the remaining list segment has the first argument as a prefix [note 5].

With breakOn at hand, we can write a proper splitOn implementation -- one that, compiled with optimisations, matches in performance the library ones mentioned at the beginning:

splitOnAtomic :: Eq a => [a] -> [a] -> [[a]]
splitOnAtomic delim
    | null delim = error "splitOnAtomic: empty delimiter"
    | otherwise = apo coalgSplit
    where
    delimLen = length delim
    coalgSplit = \case
        [] -> Cons [] (Left [])
        li ->
            let (ch, xs) = breakOn (delim `isPrefixOf`) li
            in Cons ch (Right (drop delimLen xs))

(apo, short for apomorphism, is an unfold that can be short-circuited. That is done by emitting from an unfold step, rather than the usual updated seed -- signaled by Right -- a final result -- signaled by Left. Short-circuiting is needed here because, in the empty list case, we want neither to produce an empty list by returning Nil -- which would wrongly result in splitOn delim [] = [] -- nor to resort to Cons [] [] -- which would generate an infinite tail of []. This trick corresponds directly to the additional splitOn _ [] = [[]] case added to the Data.List.Extra implementation.)

After a few slight detours, we can now address your actual question. splitOn is tricky to write in a short way because, firstly, the recursion pattern it uses isn't entirely trivial; secondly, a good implementation requires a few details that are inconvenient for golfing; and thirdly, what appears to be the best implementation relies crucially on breakOn, which is not in base.

Notes:

[note 1]: Here are the imports and pragmas needed to run the snippets in this answer:

{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE DeriveFoldable #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TemplateHaskell #-}

import Data.Functor.Foldable
import Data.Functor.Foldable.TH
import Data.List
import Data.Maybe

[note 2]: An alternative type might be Eq a => NonEmpty a -> [a] -> NonEmpty [a], if one wants to put precision above all else. I won't bother with that here to avoid unnecessary distractions.

[note 3]: As in this rather neat implementation, which uses two unfolds -- the first one (ana coalgMark) replaces the delimiters with Nothing, so that the second one (apo coalgSplit) can split in a straightforward way:

splitOnMark :: Eq a => [a] -> [a] -> [[a]]
splitOnMark delim
    | null delim = error "splitOnMark: empty delimiter"
    | otherwise = apo coalgSplit . ana coalgMark
    where
    coalgMark = \case
        [] -> Nil
        li@(x:xs) -> case stripPrefix delim li of
            Just ys -> Cons Nothing ys
            Nothing -> Cons (Just x) xs
    coalgSplit = \case
        [] -> Cons [] (Left [])
        mxs ->
            let (mch, mys) = break isNothing mxs
            in Cons (catMaybes mch) (Right (drop 1 mys))

(What apo is and what Left and Right are doing here will be covered a little further in the main body of the answer.)

This implementation has fairly acceptable performance, though with optimisations it is a slower than the one in the main body of the answer by a (modest) constant factor. It might be a little easier to golf this one, though...

[note 4]: As in this single unfold implementation, which uses a coalgebra that calls itself recursively to build each chunk as a (difference) list:

splitOnNaive :: Eq a => [a] -> [a] -> [[a]]
splitOnNaive delim 
    | null delim = error "splitOn: empty delimiter"
    | otherwise = apo coalgSplit . (,) id
    where
    coalgSplit = \case
        (ch, []) -> Cons (ch []) (Left [])
        (ch, li@(x:xs)) -> case stripPrefix delim li of
            Just ys -> Cons (ch []) (Right (id, ys))
            Nothing -> coalg (ch . (x :), xs)

Having to decide at each element whether to grow the current chunk or to start a new one is in itself problematic, as it breaks laziness.

[note 5]: Here is how Data.List.Extra implements breakOn. If we want to achieve that using a recursion-schemes unfold, one good strategy is defining a data structure that encodes exactly what we are trying to build:

data BrokenList a = Broken [a] | Unbroken a (BrokenList a)
    deriving (Eq, Show, Functor, Foldable, Traversable)
makeBaseFunctor ''BrokenList

A BrokenList is just like a list, except that the empty list is replaced by the (non-recursive) Broken constructor, which marks the break point and holds the remainder of the list. Once generated by an unfold, a BrokenList can be easily folded into a pair of lists: the elements in the Unbroken values are consed into one list, and the list in Broken becomes the other one:

breakOn :: ([a] -> Bool) -> [a] -> ([a], [a])
breakOn p = hylo algPair coalgBreak
    where
    coalgBreak = \case
        [] -> BrokenF []
        li@(x:xs)
            | p li -> BrokenF li
            | otherwise -> UnbrokenF x xs
    algPair = \case 
        UnbrokenF x ~(xs, ys) -> (x : xs, ys)
        BrokenF ys -> ([], ys)

(hylo, short for hylomorphism, is simply an ana followed by a cata, i.e. an unfold followed by a fold. hylo, as implemented in recursion-schemes, takes advantage of the fact that the intermediate data structure, created by the unfold and then immediately consumed by the fold, can be fused away, leading to significant performance gains.)

It is worth mentioning that the lazy pattern match in algPair is crucial to preserve laziness. The Data.List.Extra implementation linked to above achieves that by using first from Control.Arrow, which also matches the pair given to it lazily.

[My implementation](https://tio.run/##RcyxCsIwEIDhVzkOh3YRXXPEyVGouwhJm7QNpmnIHejQd48OiuvPzzdbfvgYaw1LXovA2YrdXwILseEcg3TJyCZaI@ob4n1jE/ha/Bhe3Wjkk9V3A25cWXMTfZpkBm6l3Y6ng26sGlrl4Dn74sGqXgsNyum/39fFhgQacglJYAc/EYkQ0PZEg/NE44T1DQ). — user202729, Mar 18 '18 at 13:33
@user202729 Further evidence that you can make it compact, as long as you're willing to go full PCG :) You might want to post it as an actual answer. — duplode, Mar 18 '18 at 13:47

One-line implementation of split in Haskell

1 Answers1

Linked