7

I have a CSV file with two columns, text and count. The goal is to transform the file from this:

some text once,1
some text twice,2
some text thrice,3

To this:

some text once,1
some text twice,1
some text twice,1
some text thrice,1
some text thrice,1
some text thrice,1

repeating each line count times and spreading the count over that many lines.

This seems to me like a good candidate for Seq.unfold, generating the additional lines, as we read the file. I have the following generator function:

let expandRows (text:string, number:int32) =
    if number = 0 
    then None
    else
        let element = text                  // "element" will be in the generated sequence
        let nextState = (element, number-1) // threaded state replacing looping 
        Some (element, nextState)

FSI yields a the following function signature:

val expandRows : text:string * number:int32 -> (string * (string * int32)) option

Executing the following in FSI:

let expandedRows = Seq.unfold expandRows ("some text thrice", 3)

yields the expected:

val it : seq<string> = seq ["some text thrice"; "some text thrice"; "some text thrice"]

The question is: how do I plug this into the context of a larger ETL pipeline? For example:

File.ReadLines(inFile)                  
    |> Seq.map createTupleWithCount
    |> Seq.unfold expandRows // type mismatch here
    |> Seq.iter outFile.WriteLine

The error below is on expandRows in the context of the pipeline.

Type mismatch. 
Expecting a 'seq<string * int32> -> ('a * seq<string * int32>) option'    
but given a     'string * int32 -> (string * (string * int32)) option' 
The type    'seq<string * int 32>' does not match the type 'string * int32'

I was expecting that expandRows was returning seq of string, as in my isolated test. As that is neither the "Expecting" or the "given", I'm confused. Can someone point me in the right direction?

A gist for the code is here: https://gist.github.com/akucheck/e0ff316e516063e6db224ab116501498

Ringil
  • 6,277
  • 2
  • 23
  • 37
akucheck
  • 185
  • 1
  • 1
  • 9

3 Answers3

6

Seq.map produces a sequence, but Seq.unfold does not take a sequence, it takes a single value. So you can't directly pipe the output of Seq.map into Seq.unfold. You need to do it element by element instead.

But then, for each element your Seq.unfold will produce a sequence, so the ultimate result will be a sequence of sequences. You can collect all those "subsequences" in a single sequence with Seq.collect:

File.ReadLines(inFile) 
    |> Seq.map createTupleWithCount 
    |> Seq.collect (Seq.unfold expandRows)
    |> Seq.iter outFile.WriteLine

Seq.collect takes a function and an input sequence. For every element of the input sequence, the function is supposed to produce another sequence, and Seq.collect will concatenate all those sequences in one. You may think of Seq.collect as Seq.map and Seq.concat combined in one function. Also, if you're coming from C#, Seq.collect is called SelectMany over there.

Fyodor Soikin
  • 78,590
  • 9
  • 125
  • 172
6

In this case, since you simply want to repeat a value a number of times, there's no reason to use Seq.unfold. You can use Seq.replicate instead:

// 'a * int -> seq<'a>
let expandRows (text, number) = Seq.replicate number text

You can use Seq.collect to compose it:

File.ReadLines(inFile)
|> Seq.map createTupleWithCount
|> Seq.collect expandRows
|> Seq.iter outFile.WriteLine

In fact, the only work performed by this version of expandRows is to 'unpack' a tuple and compose its values into curried form.

While F# doesn't come with such a generic function in its core library, you can easily define it (and other similarly useful functions):

module Tuple2 =
    let curry f x y = f (x, y)    
    let uncurry f (x, y) = f x y    
    let swap (x, y) = (y, x)

This would enable you to compose your pipeline from well-known functional building blocks:

File.ReadLines(inFile)
|> Seq.map createTupleWithCount
|> Seq.collect (Tuple2.swap >> Tuple2.uncurry Seq.replicate)
|> Seq.iter outFile.WriteLine
Mark Seemann
  • 225,310
  • 48
  • 427
  • 736
  • I love the idea of simplifying by removing Seq.unfold, but I see no reference to Seq.replicate in the MSDN doc. What am I overlooking? – akucheck Dec 29 '16 at 16:09
  • Interesting. Works in FSI, compiles, fails at runtime on Mono w System.MissingMethodException. Need to dig into this... – akucheck Dec 29 '16 at 16:20
  • 1
    @akucheck `Seq.replicate` was added in F# 4: https://github.com/Microsoft/visualfsharp/blob/fsharp4/CHANGELOG.md – Mark Seemann Dec 29 '16 at 16:52
  • Thanks! I recalled seeing that chart but had not saved a link to it - been looking all over for it. Strange that MSDN does not show this. *It seems* this API is not yet avail in Mono. Code (obviously) runs fine on Windows. – akucheck Dec 29 '16 at 17:09
  • 2
    Even if `Seq.replicate` is not there, there is still no need for `unfold`. You could just as well do `seq { for _ in 1..number -> text }`. – Fyodor Soikin Dec 29 '16 at 18:23
2

Sounds like what you want to do is actually

File.ReadLines(inFile)                  
|> Seq.map createTupleWithCount
|> Seq.map (Seq.unfold expandRows) // Map each tuple to a seq<string>
|> Seq.concat // Flatten the seq<seq<string>> to seq<string>
|> Seq.iter outFile.WriteLine

as it seems that you want to convert each tuple with count in your sequence into a seq<string> via Seq.unfold and expandRows. This is done by mapping.

Afterwards, you want to flatten your seq<seq<string>> into a large seq<string>, which is down via Seq.concat.

Ringil
  • 6,277
  • 2
  • 23
  • 37