1

I have a list of lines that that have hard breaks at a specific line number, and look like this:

Header:<SmallJson>
Header:<VeryLongJson...
....>
Header:<Json>

And I need to process it so it removes the line breaks, so it becomes like this:

Header:<SmallJson>
Header:<VeryLongJson.......>
Header:<Json>

I've come up with a solution but I'm not particularly happy about it:

let rawLines = [ "Header:<SmallJson>"
                 "Header:<VeryLongJson..."
                 "....>"
                 "Header:<Json>" ]

let processedLines = 
    (([], ""), rawLines)
    ||> List.fold (fun (result, state) line -> 
        if line.StartsWith "Header:"
        then state::result, line
        else result, state + line)
    |> (fun (result, state) -> state::result)
    |> List.rev
    |> List.tail

Is there a way to make this in a simpler way? The extra state::result and List.tail at the end of the fold particularly annoy me. Preferably without using mutation

Gustavo Guerra
  • 5,319
  • 24
  • 42

3 Answers3

3

This is essentially the "chunking" problem, which already has some good answers on SO. I like Brian's approach here, which, translated to your problem, would be:

[ let linesToJoin = ResizeArray()
  for line in rawLines do
    if line.StartsWith("Header:") && linesToJoin.Count > 0 then
      yield String.Join("", linesToJoin)
      linesToJoin.Clear()
    linesToJoin.Add(line) 
  if linesToJoin.Count > 0 then
    yield String.Join("", linesToJoin) ]

It's not more elegant but I think the intent is clearer.

Another option is to use Tomas' groupWhen function (see usage).

Community
  • 1
  • 1
Daniel
  • 47,404
  • 11
  • 101
  • 179
  • That also works, but I was looking for a solution without mutation – Gustavo Guerra Jan 20 '14 at 16:57
  • 1
    Then you will want to stay away from the `Seq` and `List` modules (which use mutation internally). JK (to make a point). I don't see the harm in narrowly scoped mutation. From the outside, this just looks like a plain old immutable list. – Daniel Jan 20 '14 at 16:58
  • I know Seq and List use mutation internally, but I'd like to avoid it myself, as an exercise. Actually, your mutable solution is worse performance-wise than my immutable one, because it's constantly adding and clearing items from the array list. – Gustavo Guerra Jan 20 '14 at 17:11
  • As you probably know, a test can be made for every desired result. `ResizeArray` only allocates (almost certainly the perf hit you're seeing) when its internal buffer is exceeded, which most likely happens a few times at the begining and then never happens again since the buffer is large enough to hold any remaining groups. Which means the perf hit is all up front and will dissipate in relation to the length of the input sequence. – Daniel Jan 20 '14 at 17:15
2

You could do a very similar tail recursive function and avoid those two steps:

let rec combineLines (currentLine:string) combinedLines = function
| (line:string)::tail when line.StartsWith "Header:" && currentLine <> "" -> 
    combineLines line (currentLine::combinedLines) tail
| line::tail -> combineLines (currentLine + line) combinedLines tail
| [] -> currentLine::combinedLines

lines |> combineLines "" [] |> List.rev
Matthew Mcveigh
  • 5,695
  • 22
  • 22
2

It's actually simpler if you process from the end with foldBack, and in particular you don't need to reverse the result so it should be faster:

let processedLines =
    (rawLines, ("", []))
    ||> List.foldBack (fun line (currentLine, allLines) ->
        if line.StartsWith "Header:" then
            "", line + currentLine :: allLines
        else
            line + currentLine, allLines)
    |> function
        | "", lines -> lines
        | _ -> failwith "The original string didn't start with 'Header:'"
Tarmil
  • 11,177
  • 30
  • 35
  • Ah, didn't recall to use foldBack. Was this kind of answer I was looking for. Thanks! – Gustavo Guerra Jan 20 '14 at 19:22
  • 1
    "*in particular you don't need to reverse the result so it should be faster*" : Well, it reverses the input, which is likely to be longer than the result, so "faster" is quite unlikely. – ildjarn Jan 21 '14 at 06:22
  • @ildjarn: More specifically, lists with more than four elements are first copied to an array ([see source on GitHub](https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/list.fs#L190-203)). – Daniel Jan 21 '14 at 17:02
  • @ildjarn: Indeed, I didn't think that through :) You would have only one pass if you started with an array instead of a list though. – Tarmil Jan 21 '14 at 21:40