Split string on multiple delimiters of any length in Haskell

Question

I am attempting a Haskell coding challenge where, given a certain string with a prefix indicating which substrings are delimiting markers, a list needs to be built from the input.

I have already solved the problem for multiple single-length delimiters, but I am stuck with the problem where the delimiters can be any length. I use splitOneOf from Data.List.Split, but this works for character (length 1) delimiters only.

For example, given

input ";,\n1;2,3,4;10",

delimiters are ';' and ','

splitting the input on the above delivers

output [1,2,3,4,10]

The problem I'm facing has two parts:

Firstly, a single delimiter of any length, e.g.

"****\n1****2****3****4****10" should result in the list [1,2,3,4,10].

Secondly, more than one delimiter can be specified, e.g.

input "[***][||]\n1***2||3||4***10",

delimiters are "***" and "||"

splitting the input on the above delivers

output [1,2,3,4,10]

My code for retrieving the delimiter in the case of character delimiters:

--This gives the delimiters as a list of characters, i.e. a String.
getDelimiter::String->[Char]
getDelimiter text = head . splitOn "\n" $ text

--drop "[delimiters]\n" from the input
body::String->String
body text = drop ((length . getDelimiter $ text)+1)) $ text 

--returns tuple with fst being the delimiters, snd the body of the input
doc::String->(String,String) 
doc text = (getDelimiter text, body text)

--given the delimiters and the body of the input, return a list of strings
numbers::(String,String)->[String]
numbers (delim, rest) = splitOneOf delim rest

--input ",@#\n1,2@3#4" gives output ["1","2","3","4"]
getList::String->[String]
getList text = numbers . doc $ text

So my question is, how do I do the processing for when the delimiters are e.g. "***" and "||"?

Any hints are welcome, especially in a functional programming context.

That's really not clear to me. Could you show us two or three examples, well formatted, with the expected inputs and outputs ? — Stéphane Laurent, Mar 12 '18 at 06:35
Thanks, that's clearer. Did you try to use successively Data.List.Split ? It is not convenient enough ? — Stéphane Laurent, Mar 12 '18 at 07:52
Just a thought. `splitWhen (`elem` [';',',']) input` gives `["","","1","2","3","4","10"]`. Then you could remove the empty strings then. — Stéphane Laurent, Mar 12 '18 at 07:57
Or something like that **filter (\x -> not $ elem x [";",",",";,"]) $ split (dropBlanks . condense $ whenElt (`elem` [';',','])) input** — Stéphane Laurent, Mar 12 '18 at 08:06
I've already solved this problem for when the delimiting markers are characters, i.e. length of one. The problem is when the delimiter is of unspecified length - this gets especially tricky, because the methods in Data.List.Split assume the delimiters are characters. **splitOn** takes a string argument - what I need is essentially a **splitOn** that takes a list of strings — Johan Jansen van Vuuren, Mar 12 '18 at 08:28
Is there some sort of contest going on? Or is this homework? https://stackoverflow.com/q/49211131/126014 — Mark Seemann, Mar 12 '18 at 16:08
You can use `splitWhen` followed by any condition, this way you're not restricted to lenght one characters. — Stéphane Laurent, Mar 12 '18 at 17:11
@mark-seemann I'm attempting the coding exercise located at [codekata](http://osherove.com/tdd-kata-1/) just for practice. — Johan Jansen van Vuuren, Mar 13 '18 at 06:32

score 3 · Accepted Answer · answered Mar 12 '18 at 16:40

If you don't mind making multiple passes over the input string, you can use splitOn from Data.List.Split, and gradually split the input string using one delimiter at a time.

You can write this fairly succinctly using foldl':

import Data.List
import Data.List.Split

splitOnAnyOf :: Eq a => [[a]] -> [a] -> [[a]]
splitOnAnyOf ds xs = foldl' (\ys d -> ys >>= splitOn d) [xs] ds

Here, the accumulator for the fold operation is a list of strings, or more generally [[a]], so you have to 'lift' xs into a list, using [xs].

Then you fold over the delimiters ds - not the input string to be parsed. For each delimiter d, you split the accumulated list of strings with splitOn, and concatenate them. You could also have used concatMap, but here I arbitrarily chose to use the more general >>= (bind) operator.

This seems to do what is required in the OP:

*Q49228467> splitOnAnyOf [";", ","] "1;2,3,4;10"
["1","2","3","4","10"]
*Q49228467> splitOnAnyOf ["***", "||"] "1***2||3||4***10"
["1","2","3","4","10"]

Since this makes multiple passes over temporary lists, it's most likely not the fastest implementation you can make, but if you don't have too many delimiters, or extremely long lists, this may be good enough.

Thank you! For my use case, this method is fine - my input strings will not be very long, so performance won't be an issue. — Johan Jansen van Vuuren, Mar 13 '18 at 07:18
Prelude> [read n :: Int | n <- ["1","2","3","9","10"]] >>>>>>>>>>>>>>> [1,2,3,9,10] — fp_mora, Mar 13 '18 at 19:38

score 2 · Answer 2 · answered Mar 12 '18 at 14:46

This problem has two kinds of solutions: the simple, and the efficient. I will not cover the efficient (because it is not simple), though I will hint on it.

But first, the part where you extract the delimiter and body parts of the input, may be simplified with Data.List.break:

delims = splitOn "/" . fst . break (== '\n')  -- Presuming the delimiters are delimited with
                                              -- a slash.
body   =               snd . break (== '\n')

In any way, we may reduce this problem to finding the positions of all the given patterns in a given string. (By saying "string", I do not mean the haskell String. Rather, I mean an arbitrarily long sequence (or even an infinite stream) of any symbols for which an Equality relation is defined, which is typed in Haskell as Eq a => [a]. I hope this is not too confusing.) As soon as we have the positions, we may slice the string to our hearts' content. If we want to deal with an infinite stream, we must obtain the positions incrementally, and yield the results as we go, which is a restriction that must be kept in mind. Haskell is equipped well enough to handle the stream case as well as the finite string.

A simple approach is to cast isPrefixOf on the string, for each of the patterns.

If some of them matches, we replace it with a Nothing.
Otherwise we mark the first symbol as Just and move to the next position.

Thus, we will have replaced all the different delimiters by a single one: Nothing. We may then readily slice the string by it.

This is fairly idiomatic, and I will bring the code to your judgement shortly. The problem with this approach is that it is inefficient: in fact, if a pattern failed to match, we would rather advance by more than one symbol.

It would be more efficient to base our work on the research that has been made into finding patterns in a string; this problem is well known and there are great, intricate algorithms that solve it an order of magnitude faster. These algorithms are designed to work with a single pattern, so some work must be put into adapting them to our case; however, I believe they are adaptable. The simplest and eldest of such algorithms is the KMP, and it is already encoded in Haskell. You may wish to take arms and generalize it − a quick path to some amount of fame.

Here is the code:

module SplitSubstr where

-- stackoverflow.com/questions/49228467

import Data.List (unfoldr, isPrefixOf, elemIndex)
import Data.List.Split (splitWhen)  -- Package `split`.
import Data.Maybe (catMaybes, isNothing)

-- | Split a (possibly infinite) string at the occurrences of any of the given delimiters.
--
-- λ take 10 $ splitOnSubstrs ["||", "***"] "la||la***fa"
-- ["la","la","fa"]
--
-- λ take 10 $ splitOnSubstrs ["||", "***"] (cycle "la||la***fa||")
-- ["la","la","fa","la","la","fa","la","la","fa","la"]
--
splitOnSubstrs :: [String] -> String -> [String]
splitOnSubstrs delims
    = fmap catMaybes       -- At this point, there will be only `Just` elements left.
    . splitWhen isNothing  -- Now we may split at nothings.
    . unfoldr f            -- Replace the occurences of delimiters with a `Nothing`.
  where

-- | This is the base case. It will terminate the `unfoldr` process.
    f [ ]  = Nothing

-- | This is the recursive case. It is divided into 2 cases:
-- * One of the delimiters may match. We will then replace it with a Nothing.
-- * Otherwise, we will `Just` return the current element.
--
-- Notice that, if there are several patterns that match at this point, we will use the first one.
-- You may sort the patterns by length to always match the longest or the shortest. If you desire
-- more complicated behaviour, you must plug a more involved logic here. In any way, the index
-- should point to one of the patterns that matched.
--
--                       vvvvvvvvvvvvvv
    f body@(x:xs) = case elemIndex True $ (`isPrefixOf` body) <$> delims of
        Just index -> return (Nothing, drop (length $ delims !! index) body)
        Nothing    -> return (Just x, xs)

It might happen that you will not find this code straightforward. Specifically, the unfoldr part is somewhat dense, so I will add a few words about it.

unfoldr f is an embodiment of a recursion scheme. f is a function that may chip a part from the body: f :: (body -> Maybe (chip, body)).

As long as it keeps chipping, unfoldr keeps applying it to the body. This is called recursive case.
Once it fails (returning Nothing), unfoldr stops and hands you all the chips it thus collected. This is called base case.

In our case, f takes symbols from the string, and fails once the string is empty.

That's it. I hope you send me a postcard when you receive a Turing award for a fast splitting algorithm.

Split string on multiple delimiters of any length in Haskell

2 Answers2