I parse data from a csv file that looks like this:
X,..,..,Dx,..,..
Y,..,..,Dy,..,..
X,..,..,Dx,..,..
Y,..,..,Dy,..,..
X,..,..,Dx,..,..
Y,..,..,Dy,..,..
Each row is an element of an array of a type I defined and used with FileHelpers. This probably isn't relevant, but I'm including this incase someone knows a trick I could do at this stage of the process using FileHelpers.
I'm only interested in pairs X,Dx and Y,Dy The data could have more than just X & Y eg.. (X,Dx); (Y,Dy); (Z,Dz); ...
I'll call the number of letters nL
The goal is to get the averages of Dx, Dy, ... for each group by processing an array of all D's which has SUM(nIterations) * nL elements.
I have a list of numbers of iterations:
let nIterations = [2000; 2000; 2000; 1000; 500; 400; 400; 400; 300; 300]
And for each of these numbers, I will have that many "letter groups." So the rows of data of interest for nIterations.[0], are rows 0 to (nIterations.[0] * nL)
To get the rows of interest for nIterations.[i], I make a list "nis" which is the result of a scan operation performed on nIterations.
let nis = List.scan (fun x e -> x + e) 0 nIterations
Then to isolate the nItertions.[i] group ..
let group = Array.sub Ds (nis.[i]*nL) (nIterations.[i]*nL)
Here's the whole thing:
nIterations |> List.mapi (fun i ni ->
let igroup = Array.sub Ds (nis.[i]*nL) (ni*nL)
let groupedbyLetter = (chunk nL igroup)
let sums = seq { for idx in 0..(nL - 1) do
let d = seq { for g in groupedbyLetter do
yield (Seq.head (Seq.skip idx g)) }
yield d |> Seq.sum }
sums |> Seq.map (fun x -> (x / (float ni))) ) |> List.ofSeq
That "chunk" function is one I found on SO:
let rec chunk n xs =
if Seq.isEmpty xs then Seq.empty
else
let (ys,zs) = splitAt n xs
Seq.append (Seq.singleton ys) (chunk n zs)
I have verified this works, and gets me what I want - a size nL collection of size nIterations.Length collections.
The problem is speed - this only works on small data sets; the sizes I'm working with in the example I've given are too big. It gets "hung" at the chunk function.
So my question is: How do I go about improving the speed of this whole process? (and/or) What is the best (or atleast a better) way to do that "transposition"
I figure I could:
- try to rearrange the data as I'm reading it in
- try to index the elements directly
- try breaking the process into smaller stages or "passes"
- ???