1

I've googled and read, and I'm trying to find a "correct" way to do it, but every question I read on SO seems to have completely different answers.

Here is the gist of my problem. files has the type signature of a seq of a triple (a:string, b:string,c:Int64). Being new to f# I'm still not fluent in expressing type signatures (or for that matter understanding them). a is a filename, b is an internal identifier, and c is a value representing the length (size) of the file. baseconfig is a string from earlier in the code.

ignore(files 
    |> Seq.filter( fun(x,y,z) ->  y = baseconfig)  // used to filter only files we want
    |> Seq.fold( fun f n   -> 
        if( (fun (_,_,z) -> z) n > 50L*1024L*1024L) then
            zipfilex.Add((fun (z:string, _, _) -> z) n)
            printfn("Adding 50mb to zip")
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            ("","",0L)
        else
            zipfilex.Add((fun (z, _, _) -> z) n)
            ("", "", (fun (_, _, z:Int64) -> z) n + (fun (_, _, z:Int64) -> z) f)
    ) ("","",0L)
    )

What this chunk of code is supposed to do, is iterate through each file in files, add it to a zip archive (but not really, it just goes on a list to be committed later), and when the files exceed 50MB, commit the currently pending files to the zip archive. Adding a file is cheap, committing is expensive, so I try to mitigate the cost by batching it.

So far the code kinda works... Except for the ObjectDisposedException I got when it approached 150MB of committed files. But I'm not sure this is the right way to do such an operation. It feels like I'm using Seq.fold in a unconventional way, but yet, I don't know of a better way to do it.

Bonus question: Is there a better way to snipe values out of tuples? fst and snd only work for 2 valued tuples, and I realize you can define your own functions instead of inline them like I did, but it seems there should be a better way.

Update: My previous attempts at fold, I couldn't understand why I couldn't just use an Int64 as an accumulator. Turns out I was missing some critical parenthesis. Little simpler version below. Also eliminates all the crazy tuple extraction.

ignore(foundoldfiles 
    |> Seq.filter( fun (x,y,z) ->  y = baseconfig) 
    |> Seq.fold( fun (a) (f,g,j)   -> 
        zipfilex.Add( f)
        if( a > 50L*1024L*1024L) then
            printfn("Adding 50mb to zip")
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            0L
        else
             a + j
    ) 0L
    )

Update 2: I'm going to have to go with an imperative solution, F# is somehow re-entering this block of code, after the zip file is closed in the statement which follows it. Which explains the ObjectDisposedException. No idea how that works or why.

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
hova
  • 2,811
  • 20
  • 19
  • Re: unpacking tuples, see this question: http://stackoverflow.com/questions/600216/accessing-a-specific-member-in-a-f-tuple. Pattern matching is the answer. The way you're doing it is fine, or, you can use a `let` binding. – Daniel Apr 15 '11 at 15:37
  • The use of `Seq.fold` feels odd here. Sometimes a `for` loop is the best solution, especially since you're using mutable objects. It may be best to write the solution as you would in C# (clear and correct), then look for ways to refine it using functional techniques. Mutability "in the small" is okay. – Daniel Apr 15 '11 at 15:40
  • But the mutability is outside the scope of traversing the list in a defined manner. This should be an easy to solve problem with a functional language, but instead I find myself stumped. – hova Apr 15 '11 at 15:49
  • @hova - what zip library are you using? – Stephen Swensen Apr 15 '11 at 16:05
  • @Stephen Swensen ICSharpZipLib – hova Apr 15 '11 at 16:09
  • As with any complex code you should break it into pieces. Take the anonymous function outside of the fold and give it a name along with giving its parameters names. Your code will be a lot more readable. – gradbot Apr 15 '11 at 18:50
  • 1
    @gradbot I don't see how, the function is used only once and is specific to the fold operation. There is no need to refactor it. – hova Apr 15 '11 at 19:30

4 Answers4

4

As an alternative to the "dirty" imperative style, you can extend the Seq module with a general and reusable function for chunking. The function is a bit like fold, but it takes a lambda that returns option<'State>. If it returns None, then a new chunk is started and otherwise the element is added to the previous chunk. Then you can write an elegant solution:

files
|> Seq.filter(fun (x, y, z) ->  y = baseconfig) 
|> Seq.chunkBy(fun (x, y, z) sum -> 
     if sum + z > 50L*1024L*1024L then None
     else Some(sum + z)) 0L
|> Seq.iter(fun files ->
    zipfilex.BeginUpdate()
    for f, _, _ in files do zipfilex.Add(f)
    zipfilex.CommitUpdate())

The implementation of the chunkBy function is a bit longer - it needs to use IEnumerator directly & it can be expressed using recursion:

module Seq = 
  let chunkBy f initst (files:seq<_>) = 
    let en = files.GetEnumerator()
    let rec loop chunk st = seq {
      if not (en.MoveNext()) then
        if chunk <> [] then yield chunk
      else
        match f en.Current st with
        | Some(nst) -> yield! loop (en.Current::chunk) nst
        | None -> 
            yield chunk 
            yield! loop [en.Current] initst }
    loop [] initst
Tomas Petricek
  • 240,744
  • 19
  • 378
  • 553
  • @Tomas: Your solutions always come out of left field...I like it! – Daniel Apr 15 '11 at 16:20
  • Nice function, but it seems buggy. You should add `else` before the `match`. Also, some elements are missing, change last line into `yield! loop [en.Current] initst }`. – Laurent Apr 15 '11 at 16:59
  • It is amazing and somewhat confusing that all the F# questions seem to have at least 4 different types of answers – hova Apr 15 '11 at 19:31
  • It's my first day with functional programming and I'm currently reading Tomas' book (amazing coincidence, I had to go back to the cover page to verify that's him). As I think I understand it currently, you can express a transformation in many different ways - there will be as many different answers as there are different people. – Vladislav Zorov Apr 15 '11 at 20:03
  • @hova: I see that as a strength of F#, unless Python is your thing. :-) – Daniel Apr 15 '11 at 20:08
  • So, +1 for the nice reusable function! – Laurent Apr 15 '11 at 22:44
2

I don't think your problem benefits from the use of fold. It's most useful when building immutable structures. My opinion, in this case, is that it makes what you're trying to do less clear. The imperative solution works nicely:

let mutable a = 0L
for (f, g, j) in foundoldfiles do
    if g = baseconfig then
        zipfilex.Add(f)
        if a > 50L * 1024L * 1024L then
            printfn "Adding 50mb to zip"
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            a <- 0L
        else
            a <- a + j
Daniel
  • 47,404
  • 11
  • 101
  • 179
  • *Personally*, I do what I can to avoid `mutable`, and I don't think `fold` is too unwieldy in this case. But it's nice to see the alternative :-] – ildjarn Apr 15 '11 at 16:10
  • Also note that your code suffers from the same filesize bug I described in my answer. – ildjarn Apr 15 '11 at 16:12
  • 1
    I used to feel that way, but after spending far too much time deciphering convoluted `fold`s written months prior, imperative is refreshing at times. – Daniel Apr 15 '11 at 16:12
  • @ildjarn: I didn't address the bugs/exception in his code, just the overall approach. Sometimes introducing clarity makes it easier to spot the problems. – Daniel Apr 15 '11 at 16:14
  • @Daniel - agree 100%. mutation and side affects acting on objects captured in a closure passed to a higher-order function is already disturbing. discarding the result of the fold drives home the misuse. – Stephen Swensen Apr 15 '11 at 16:16
  • @Stephen - You summed it up well. :-) – Daniel Apr 15 '11 at 16:17
  • *Personally*, I don't care if I use a `mutable` in a function, as long as the function is pure from the outside. Ignoring the result of a fold look odd to me. – Laurent Apr 15 '11 at 16:20
  • I disagree re: `fold` being a "misuse" -- `fold` threads an accumulator through the sequence for you; whether or not that accumulated value is useful after `fold` is finished is somewhat beside the point, because there isn't a good built-in alternative that threads an accumulator then discards the result. I.e., I think this boils down to personal preference moreso than calling it anti-idiomatic. – ildjarn Apr 15 '11 at 16:21
  • 1
    @ildjarn: You could say `for` threads 0 to N accumulators through a sequence, performing an action on each item, and returning `unit`. :-) Which sounds more fitting in this case? Granted, the differences are slight, but if we're talking about minor differences in suitability... – Daniel Apr 15 '11 at 16:24
  • @Daniel : That's a fair assessment; `for` is just something that's too imperative for my personal tastes. I don't think I have a single piece of F# that uses it, as I always bias towards higher order or recursive functions. I.e., if I were that offended by discarding the result of a `fold`, I'd write a recursive function before I'd use `for`. Anyway, I'll quit spamming the comments now :-] – ildjarn Apr 15 '11 at 16:28
  • For future readers of this "spam," I think it's worth mentioning that closures are at the heart of this issue because they introduce limitations, which, in this case, are arguably unmerited. Okay, I'm done. :-) – Daniel Apr 15 '11 at 16:30
  • 2
    In the case where a > 50L*1024L*1024L, shouldn't you be setting a back to 0L ? – petebu Apr 15 '11 at 20:09
1

Here's my take:

let inline zip a b = a, b

foundoldfiles 
|> Seq.filter (fun (_, internalid, _) -> internalid = baseconfig)
|> zip 0L
||> Seq.fold (fun acc (filename, _, filesize) -> 
    zipfilex.Add filename
    let acc = acc + filesize
    if acc > 50L*1024L*1024L then
        printfn "Adding 50mb to zip"
        zipfilex.CommitUpdate ()
        zipfilex.BeginUpdate ()
        0L
    else acc)
|> ignore

Some notes:

  • The zip helper function makes for a clean a pipeline through the entire function without any overhead, and in more complex scenarios helps with type inferrence since the state gets shifted from the right to the left side of the fold functor (though that doesn't matter or help in this particular case)
  • The use of _ to locally discard elements of the tuple that you don't need makes the code easier to read
  • The approach of pipelining into ignore rather than wrapping the entire expression with extra parenthesis makes the code easier to read
  • Wrapping the arguments of unary functions in parenthesis looks bizarre; you can't use parenthesis for non-unary curried functions, so using them for unary functions is inconsistent. My policy is to reserve parenthesis for constructor calls and tupled-function calls

EDIT: P.S. if( a > 50L*1024L*1024L) then is incorrect logic -- the if needs to take into account the accumulator plus the current filesize. E.g., if the first file was >= 50MB then the if wouldn't trigger.

ildjarn
  • 62,044
  • 9
  • 127
  • 211
  • @hova: It forward pipes the first and second element of a tuple, i.e., they become the next-to-last and last args of the function to which they're piped. – Daniel Apr 15 '11 at 16:08
  • @hova: I asked this very question some time ago: http://stackoverflow.com/q/2867625/162396 – Daniel Apr 15 '11 at 20:02
1

If you're not fond of mutable variables and imperative loops, you could always rewrite this using GOTO a functional loop:

let rec loop acc = function
    | (file, id, size) :: files ->
        if id = baseconfig then
            zipfilex.Add file
            if acc > 50L*1024L*1024L then
                printfn "Adding 50mb to zip"
                zipfilex.CommitUpdate()
                zipfilex.BeginUpdate()
                loop 0L files
            else
                loop (acc + size) files
        else
            loop acc files
    | [] -> ()

loop 0L foundoldfiles

The advantage of this is it explicitly states the three different ways that the inductive case can proceed and how the accumulator is transformed in each case (so you're less likely to get this wrong - witness the bug in Daniel's for loop version).

You could even move the baseconfig check into a when clause:

let rec loop acc = function
    | (file, id, size) :: files when id = baseconfig ->
        zipfilex.Add file
        if acc > 50L*1024L*1024L then
            printfn "Adding 50mb to zip"
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            loop 0L files
        else
            loop (acc + size) files
    | _ :: files -> loop acc files
    | [] -> ()

loop 0L foundoldfiles
petebu
  • 1,827
  • 12
  • 15
  • Could this possibly run into a stackoverflow exception or is this tail recursive? – hova Apr 15 '11 at 21:10
  • @hova: It's tail-recursive. In fact, a quick look in reflector shows that the compiler turns the body into imperative looping code (using branch instructions and labels). – petebu Apr 15 '11 at 21:24
  • +1. This is the approach I would *personally* take (though maybe using `LazyList` instead of `list`), despite my devil's advocate stance regarding `fold` in the above comments. – ildjarn Apr 16 '11 at 00:21