Arguably, the best way to check how feasible a short-and-sweet splitOn
(or split
, as you and MissingH call it -- here I will stick to the name used by the split and extra packages) is trying to write it [note 1].
(By the way, I will use recursion-schemes functions and concepts in this answer, as I find systematising things a bit helps me think about this kind of problem. Do let me know if anything is unclear.)
The type of splitOn
is [note 2]:
splitOn :: Eq a => [a] -> [a] -> [[a]]
One way to write a recursive function that builds one data structure from another, like splitOn
does, begins by asking whether to do it by walking the original structure in a bottom-up or a top-down way (for lists, that amounts to right-to-left and left-to-right respectively). A bottom-up walk is more naturally expressed as some kind of fold:
foldr @[] :: (a -> b -> b) -> b -> [a] -> b
cata @[_] :: (ListF a b -> b) -> [a] -> b
(cata
, short for catamorphism, is how recursion-schemes expresses a vanilla fold. The ListF a b -> b
function, called an algebra in the jargon, specifies what happens in each fold step. data ListF a b = Nil | Cons a b
, and so, in the case of lists, the algebra amounts to the two first arguments of foldr
rolled into one -- the binary function corresponds to the Cons
case, and the seed of the fold, to the Nil
one.)
A top-down walk, on the other hand, lends itself to an unfold:
unfoldr :: (b -> Maybe (a, b)) -> b -> [a] -- found in Data.List
ana @[_] :: (b -> ListF a b) -> b -> [a]
(ana
, short for anamorphism, is the vanilla unfold in recursion-schemes. The b -> ListF a b
function is a coalgebra; it specifies what happens in each unfold step. For a list, the possibilities are either emitting a list element and an updated seed or generating an empty list and terminating the unfold.)
Should splitOn
be bottom-up or top-down? To implement it, we need to, at any given position in the list, look ahead in order to check whether the current list segment starts with the delimiter. That being so, it makes sense to reach for a top-down solution i.e. an unfold/anamorphism.
Playing with ways to write splitOn
as an unfold shows another thing to consider: you want each individual unfold step to generate a fully-formed list chunk. Not doing so will, at best, lead you to unnecessarily walk the original list twice [note 3]; at worst, catastrophic memory usage and stack overflows on long list chunks await [note 4]. One way to achieve that is through a breakOn
function, like the one in Data.List.Extra
...
breakOn :: Eq a => [a] -> [a] -> ([a], [a])
... which is like break
from the Prelude, except that, instead of applying a predicate to each element, it checks whether the remaining list segment has the first argument as a prefix [note 5].
With breakOn
at hand, we can write a proper splitOn
implementation -- one that, compiled with optimisations, matches in performance the library ones mentioned at the beginning:
splitOnAtomic :: Eq a => [a] -> [a] -> [[a]]
splitOnAtomic delim
| null delim = error "splitOnAtomic: empty delimiter"
| otherwise = apo coalgSplit
where
delimLen = length delim
coalgSplit = \case
[] -> Cons [] (Left [])
li ->
let (ch, xs) = breakOn (delim `isPrefixOf`) li
in Cons ch (Right (drop delimLen xs))
(apo
, short for apomorphism, is an unfold that can be short-circuited. That is done by emitting from an unfold step, rather than the usual updated seed -- signaled by Right
-- a final result -- signaled by Left
. Short-circuiting is needed here because, in the empty list case, we want neither to produce an empty list by returning Nil
-- which would wrongly result in splitOn delim [] = []
-- nor to resort to Cons [] []
-- which would generate an infinite tail of []
. This trick corresponds directly to the additional splitOn _ [] = [[]]
case added to the Data.List.Extra
implementation.)
After a few slight detours, we can now address your actual question. splitOn
is tricky to write in a short way because, firstly, the recursion pattern it uses isn't entirely trivial; secondly, a good implementation requires a few details that are inconvenient for golfing; and thirdly, what appears to be the best implementation relies crucially on breakOn
, which is not in base.
Notes:
[note 1]: Here are the imports and pragmas needed to run the snippets in this answer:
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE DeriveFoldable #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TemplateHaskell #-}
import Data.Functor.Foldable
import Data.Functor.Foldable.TH
import Data.List
import Data.Maybe
[note 2]: An alternative type might be Eq a => NonEmpty a -> [a] -> NonEmpty [a]
, if one wants to put precision above all else. I won't bother with that here to avoid unnecessary distractions.
[note 3]: As in this rather neat implementation, which uses two unfolds -- the first one (ana coalgMark
) replaces the delimiters with Nothing
, so that the second one (apo coalgSplit
) can split in a straightforward way:
splitOnMark :: Eq a => [a] -> [a] -> [[a]]
splitOnMark delim
| null delim = error "splitOnMark: empty delimiter"
| otherwise = apo coalgSplit . ana coalgMark
where
coalgMark = \case
[] -> Nil
li@(x:xs) -> case stripPrefix delim li of
Just ys -> Cons Nothing ys
Nothing -> Cons (Just x) xs
coalgSplit = \case
[] -> Cons [] (Left [])
mxs ->
let (mch, mys) = break isNothing mxs
in Cons (catMaybes mch) (Right (drop 1 mys))
(What apo
is and what Left
and Right
are doing here will be covered a little further in the main body of the answer.)
This implementation has fairly acceptable performance, though with optimisations it is a slower than the one in the main body of the answer by a (modest) constant factor. It might be a little easier to golf this one, though...
[note 4]: As in this single unfold implementation, which uses a coalgebra that calls itself recursively to build each chunk as a (difference) list:
splitOnNaive :: Eq a => [a] -> [a] -> [[a]]
splitOnNaive delim
| null delim = error "splitOn: empty delimiter"
| otherwise = apo coalgSplit . (,) id
where
coalgSplit = \case
(ch, []) -> Cons (ch []) (Left [])
(ch, li@(x:xs)) -> case stripPrefix delim li of
Just ys -> Cons (ch []) (Right (id, ys))
Nothing -> coalg (ch . (x :), xs)
Having to decide at each element whether to grow the current chunk or to start a new one is in itself problematic, as it breaks laziness.
[note 5]: Here is how Data.List.Extra
implements breakOn
. If we want to achieve that using a recursion-schemes unfold, one good strategy is defining a data structure that encodes exactly what we are trying to build:
data BrokenList a = Broken [a] | Unbroken a (BrokenList a)
deriving (Eq, Show, Functor, Foldable, Traversable)
makeBaseFunctor ''BrokenList
A BrokenList
is just like a list, except that the empty list is replaced by the (non-recursive) Broken
constructor, which marks the break point and holds the remainder of the list. Once generated by an unfold, a BrokenList
can be easily folded into a pair of lists: the elements in the Unbroken
values are consed into one list, and the list in Broken
becomes the other one:
breakOn :: ([a] -> Bool) -> [a] -> ([a], [a])
breakOn p = hylo algPair coalgBreak
where
coalgBreak = \case
[] -> BrokenF []
li@(x:xs)
| p li -> BrokenF li
| otherwise -> UnbrokenF x xs
algPair = \case
UnbrokenF x ~(xs, ys) -> (x : xs, ys)
BrokenF ys -> ([], ys)
(hylo
, short for hylomorphism, is simply an ana
followed by a cata
, i.e. an unfold followed by a fold. hylo
, as implemented in recursion-schemes, takes advantage of the fact that the intermediate data structure, created by the unfold and then immediately consumed by the fold, can be fused away, leading to significant performance gains.)
It is worth mentioning that the lazy pattern match in algPair
is crucial to preserve laziness. The Data.List.Extra
implementation linked to above achieves that by using first
from Control.Arrow
, which also matches the pair given to it lazily.