many parallel applications of a sequential transform in repa

Question

In Repa, I would like to apply a certain d-dimensional linear transform in parallel across the innermost dimension of my array, i.e., on all "column" vectors.

In general, such a transformation can be expressed as a matrix M, and each entry of M*v is just a dot product of the appropriate row of M with v. So I could just use traverse with a function that computes the appropriate dot product. This costs d^2 work.

However, my M is special: it admits a linear-work sequential algorithm. For example, M might be a lower-triangular matrix with 1s throughout the lower triangle. Then M*v is just the vector of partial sums of v (a.k.a. "scan"). These sums can be computed sequentially in an obvious way, but one needs the (i-1) st entry of the result to compute the ith entry efficiently. (I have several such M, all of which can be computed in one way or the other in linear sequential time.)

I don't see any obvious way to use traverse (or any other Repa functions) to take advantage of this property of M. Can it be done? It will be quite wasteful to use a d^2-work algorithm (even with abundant parallelism) when there's such a fast linear-work algorithm available.

(I've seen some old SO posts (e.g., here) asking similar questions, but nothing that quite matches my situation.)

UPDATE

By request here is some illustrative code, for the M that computes partial sums (as described above). As I expected, the runtime (work) grows super-linearly in d, the second argument of the extent of the array (ext). This comes from the fact that mulM' only specifies how to compute the ith entry of the output, independent of all other entries. Even though there is a linear-time algorithm in the total size of the array, I don't know how to express it in Repa.

Interestingly, if I remove the line that defines the manifest array' from main, then the runtime scales only linearly in the total size of the array! So when the arrays are delayed "all the way down," fusion/optimization must somehow be extracting the linear-work algorithm, but without any explicit help from me. That's amazing, but also not very useful to me, because in reality, I'll need to call mulM on manifest arrays.

{-# LANGUAGE TypeOperators, ScopedTypeVariables, FlexibleContexts #-}

module Main where

import Data.Array.Repa as R

-- multiplication by M across innermost dimension
mulM arr = traverse arr id mulM'
    where mulM' _ idx@(i' :. i) =
              sumAllS $ extract (Z:.0) (Z:.(i+1)) $ slice arr (i' :. All)

ext = Z :. (1000000::Int) :. (10::Int) -- super-linear runtime in 2nd arg
--ext = Z :. (10::Int) :. (1000000::Int) -- takes forever

array = fromFunction ext (\(Z:.j:.i) -> j+i)

main :: IO ()
main = do
  -- apply mulM to a manifest array
  array' :: Array U DIM2 Int <- computeP $ array
  ans :: Array U DIM2 Int <- computeP $ mulM array'
  print "done"

Is this question specifically about how to write `scan` using existing library functions? Since the constructors for all of the array types are exported, you could probably write your own function on the underlying representation (ie, a `Vector` for `Array U`). — user2407038, Jun 11 '14 at 00:24
I wasn't asking about how to write `scan`, but would be fine with doing so if it solved my problem. However, I am skeptical that it would do so, because exploiting the underlying `Vector` representation means losing multidimensionality, index transforms, etc. that I need to use. — Chris Peikert, Jun 11 '14 at 00:33
Be more specific. Publish your "correct but inefficient" code. — leventov, Jun 11 '14 at 08:46
@leventov I've thought of that, and will try it out. However, I think it inefficient in space, because it creates a separate manifest array (or vector) of partial sums for each column of the input. That's linear auxiliary space (possibly it could be optimized away...). — Chris Peikert, Jun 11 '14 at 19:02
I am inclined to think that you can't do this with repa. After all it is designed to act in parallel on all the elements. Why are your arrays manifest? Can you delay them somehow and then allow fusion to work its magic? — idontgetoutmuch, Jun 12 '14 at 08:26
@DominicSteinitz I need manifest arrays at various places, for sharing (otherwise performance is terrible). Delaying a manifest array before feeding it to `mulM` does not help. (My previous deleted comment said that it did help, but I was running the wrong code.) Only when the arrays are "delayed all the way down" do I get linear runtime. — Chris Peikert, Jun 12 '14 at 14:14
Addendum: delaying a manifest array before feeding it to mulM does help a little (around 2.5x speedup), and I think I see why. But the runtime still is essentially quadratic in the "column" size. — Chris Peikert, Jun 12 '14 at 14:21

score 0 · Answer 1 · answered Aug 13 '23 at 12:45

Here's a possible modified version of your code, using delayed arrays:

  {-# LANGUAGE TypeOperators #-}

  import Data.Array.Repa as R

  mulM :: (Num a, Source r a) => Array r DIM2 a -> Array D DIM2 a
  mulM arr = traverse arr id mulM'
  where
  mulM' _ idx@(i' :. i) =
  sumAllS $ extract (Z:.0) (Z:.(i+1)) $ slice arr (i' :. All)

  ext :: DIM2
  ext = Z :. (1000000::Int) :. (10::Int)

  array :: Array D DIM2 Int
  array = fromFunction ext (\(Z:.j:.i) -> j+i)

  main :: IO ()
  main = do
  let delayedArray :: Array D DIM2 Int
  delayedArray = delay array
  result = computeUnboxedS $ mulM delayedArray
  print "done"

Please note that this is a simplified example, and you may need to adapt and experiment further with Repa's features to fully capture the desired linear-work algorithm for your specific matrix M. Be sure to refer to Repa's documentation for more information on its capabilities and how to best express your computations.

many parallel applications of a sequential transform in repa

1 Answers1