2

I created simple passthrough compute kernel

kernel void filter(texture2d<float, access::read> inTexture [[texture(0)]],
                         texture2d<float, access::write> outTexture [[texture(1)]],
                         uint2 gridPos [[ thread_position_in_grid ]]) {
  float4 color = inTexture.read(gridPos);
  outTexture.write(color, gridPos);
}

Measuring the execution time

[self.timer start];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
CGFloat ms = [self.timer elapse];

Timer class works like this:

- (void)start {
  self.startMach = mach_absolute_time();
}  

- (CGFloat)elapse {
  uint64_t end = mach_absolute_time();
  uint64_t elapsed = end - self.startMach;
  uint64_t nanosecs = elapsed * self.info.numer / self.info.denom;
  uint64_t millisecs = nanosecs / 1000000;

  return millisecs;
}

Dispatch call:

static const NSUInteger kGroupSize = 16;
- (MTLSize)threadGroupSize {
  return MTLSizeMake(kGroupSize, kGroupSize, 1);
}

- (MTLSize)threadGroupsCount:(MTLSize)threadGroupSize {
  return MTLSizeMake(self.provider.texture.width / kGroupSize,
                 self.provider.texture.height / kGroupSize, 1);
}

[commandEncoder dispatchThreadgroups:threadgroups 
               threadsPerThreadgroup:threadgroupSize];

gives me 13 ms on 512x512 rgba image and it grows lineary if I perform more passes.

Is this correct? It seems too much overhead for real time application.

Michael Kupchick
  • 433
  • 2
  • 10
  • Doesn't seem right. I'm wondering what `self.timer` is, the `NSTimer` is [not really suitable for doing such timing](http://stackoverflow.com/questions/17414344/accuracy-of-nstimer) of course it could be a custom class that wraps `CFAbsoluteTimeGetCurrent` or similar. Probably worthwhile including the details of your call to `dispatchThreadgroups` in your sample code. Could be something 'off' in this call. Lastly, if doing multiple passes I'd probably drop the `waitUntilCompleted` from most passes, and only call it on the last pass. – lock Jul 14 '16 at 01:40
  • I drop the waitUntilCompleted from all the passes except the last one, I will add the timer and dispatch code – Michael Kupchick Jul 14 '16 at 06:51
  • Code seems fine. Suggest having a look at the GPU profiler in Instruments to see if it concurs. – lock Jul 14 '16 at 08:45
  • Vague hypothesising here: The amount of time sounds suspiciously close to the time of one display frame (16.67ms) - could it be a scheduling issue with the main UI graphics rendering? Is this on the main CPU thread? Does running it on a non-main thread make a difference? – pmdj Jul 20 '16 at 20:41
  • A compute shader is scheduled differently than a vertex or fragment shader. What is the performance if you put the shader logic into the render cycle? – MoDJ Sep 09 '16 at 01:41

1 Answers1

4

Compute kernels are known to have rather high overhead on A7 processors. One thing to consider, though, is that this is basically the least flattering test you can run: a one-shot threadgroup dispatch might take ~2ms to get scheduled, but scheduling of subsequent dispatches can be up to an order of magnitude faster. Additionally there's little chance for latency hiding here. In practice, a much more complex kernel probably wouldn't take substantially longer to execute, and if you can interleave it with whatever rendering you might be doing, you might find performance to be acceptable.

warrenm
  • 31,094
  • 6
  • 92
  • 116