LoadPage in Shady very slow on Linux Mint

Question

I am trying to display a sequence of frames using Shady, but I'm running into difficulties. I'm looking at 25 frames, covering a 1080x1080 pixels area. The stimulus is grayscale, and I am doing luminance linearization off-line, so I only need to save a uint8 value for each pixel. The full sequence is thus about 29Mb. I define the stimulus as a 3-D numpy array [1080x1080x25], and I save it to disk using np.save(). I then load it using np.load().

    try:
        yy = np.load(fname)
    except:
        print fname + ' does not exist'
        return

This step takes about 20ms. It is my understanding that Shady does not deal with uint8 luminance values, but rather with floats between 0 and 1. I thus convert it into a float array and divide by 255.

yy = yy.astype(np.float)/255.0

This second step takes approx 260ms, which is already not great (ideally I need to get the stimulus loaded and ready to be presented in 400ms). I now create a list of 25 numpy arrays to use as my pages parameter in the Stimulus class:

    pages = []
    for j in range(yy.shape[2]):
        pages.append(np.squeeze(yy[:, :, j]))

This is virtually instantaneous. But at my next step I run into serious timing problems.

if (self.sequence is None):
    self.sequence = self.wind.Stimulus(pages, 'sequence', multipage=True, anchor=Shady.LOCATION.UPPER_LEFT, position=[deltax, deltay], visible=False)
else:
    self.sequence.LoadPages(pages, visible=False)

Here I either create a Stimulus object, or update its pages attribute if this is not the first sequence I load. Either way, this step takes about 10s, which is about 100 times what I can tolerate in my application.

Is there a way to significantly speed this up? What am I doing wrong? I have a pretty mediocre graphics card on this machine (Radeon Pro WX 4100), and if that is the problem I could upgrade it, but I do not want to go through the hassle if that is not going to fix it.

You don't need to explicitly convert from `uint8` to floating-point—Shady will also accept `uint8` as-is. Also, `pages=numpy.split(yy, yy.shape[2], axis=2)` will replace your loop. My first attempt to replicate this with a 1080x1080x25 floating-point array on a late 2013 Retina MacBook (so, also not a stellar graphics card) took 0.5s to create the stimulus and 0.4 to update with `LoadPages`. It's slightly less (0.4,0.3) with `uint8` arrays. So something mysterious is going on. — jez, Jun 19 '19 at 20:56
But I *would* expect the time taken to be of that order-of-magnitude (definitely neither 10s nor 0.1s). Loading a new stimulus array from CPU to GPU is not something that can usually be done from one frame to the next without causing a skip. Anything that needs to be time-critical should be pre-transferred (can you create multiple `Stimulus` instances in advance, and only have one visible at any one time?) — jez, Jun 19 '19 at 20:57
Even on a Ubuntu *virtual machine* on said 6-year-old MacBook, the transfer commands return in 1–1.5 seconds. Does your system show good timing performance when you run the Shady example scripts? How long does it take to load the animated aliens in the `animated-textures` demo, for example? — jez, Jun 19 '19 at 21:12
To debug, maybe you should print `len(pages)` and `[page.shape for page in pages]` to the console just before your `LoadPages` call. — jez, Jun 19 '19 at 21:22
len(pages) is 25 and page.shape is (1080, 1080, 1) for all pages. I'm not trying to load a sequence between frames, I'm trying to load a sequence per trial. The trial consists of ~400ms of inter-trial interval, about 800ms of fixation, and the sequence presentation (25 frames @ 144Hz). Ideally I'd load the sequence in the ITI (but I could also take up a chunk of the fixation period). — cq70, Jun 20 '19 at 12:55
On my setup (Linux Mint 19, i.e., Ubuntu 18.04) switching from floats to unit8 made a massive difference, to the point that I'm (almost) ok. Whereas with floats it took between 10 and 14s to load a sequence, with uint8 it took between 330 and 360ms. — cq70, Jun 20 '19 at 13:00
Glad it works—300–400ms is to be expected I think—but disturbed about the 10-14s. I cannot replicate that with my own Ubuntu 18.04 machine (Lenovo Horizon II, 6 years old, terrible Intel graphics card) or my Ubuntu 18.04 VM on the Mac. Would you mind submitting details at https://bitbucket.org/snapproject/shady-hg/issues and I can continue with troubleshooting suggestions from there. — jez, Jun 20 '19 at 13:54

score 0 · Answer 1 · answered Jun 20 '19 at 13:58

Based on jez comments, his tests, and my tests, I guess that on some configurations (in my case a Linux Mint 19 with Cinnamon and a mediocre AMD video card) loading floats can be much slower than loading uint8. With uint8 the behavior appears to be consistent across configurations. So go with uint8 if you can. Since this will (I assume) disable much of what Shady can do in terms of gamma correction and dynamic range enhancement, this might be limiting for some.

score 0 · Answer 2 · answered Jun 21 '19 at 19:28

Shady can accept uint8 pixel values as-is so you can cut out your code for scaling and type-conversion. Of course, you lose out on Shady's ability to do dynamic range enhancement that way, but it seems like you have your own offline solutions for that kind of thing. If you're going to use uint8 stimuli exclusively, you can save a bit of GPU processing effort by turning off dithering (set the .ditheringDenominator of both the World and the Stimulus to 0 or a negative value).

It seems like the ridiculous 10–to-15-second delays come from inside the compiled binary "accelerator" component, when transferring the raw texture data from RAM to the graphics card. The problem is apparently (a) specific to transferring floating-point texture data rather than integer data, and (b) specific to the graphics card you have (since you reported the problem went away on the same system when you swapped in an NVidia card). Possibly it's also OS- or driver-specific with regard to the old graphics card.

Note that you can also reduce your LoadPages() time from 300–400ms down to about 40ms by cutting down the amount of numpy operations Shady has to do. Save your arrays as [pages x rows x columns] instead of [rows x columns x pages]. Relative to your existing workflow, this means you do yy = yy.transpose([2, 0, 1]) before saving. Then, when you load, don't transpose back: just split on axis=0, and then squeeze the leftmost dimension out of each resulting page:

pages = [ page.squeeze(0) for page in numpy.split(yy, yy.shape[0], axis=0) ]

That way you'll end up with 25 views into the original array, each of which is a contiguous memory block. By contrast, if you do it the original [rows x columns x pages] way, then regardless of whether you do split-and-squeeze or your original slice-and-squeeze loop, you get 25 non-contiguous views into the original memory, and that fact will catch up with you sooner or later—if not when you or Shady convert between numeric formats, then at latest when Shady uses numpy's .tostring method to serialize the data for transfer.

LoadPage in Shady very slow on Linux Mint

2 Answers2