3

I have 0-1 valued vectors that I need to do some matrix operations on. They are not very sparse (only half of the values are 0) but saving them as a logical variable instead of double saves 8 times the memory: 1 byte for a logical, and 8 for double floating point.

Would it be any slower to do matrix multiplications of a logical vector and a double matrix than to use both as double? See my preliminary results below:

>> x = [0 1 0 1 0 1 0 1]; A = rand(numel(x)); xl = logical(x);
>> tic; for k = 1:10000; x * A * x'; end; toc %'
Elapsed time is 0.017682 seconds.
>> tic; for k = 1:10000; xl * A * xl'; end; toc %'
Elapsed time is 0.026810 seconds.
>> xs = sparse(x);
>> tic; for k = 1:10000; xs * A * xs'; end; toc %'
Elapsed time is 0.039566 seconds.

It seems that using logical representation is much slower (and sparse is even slower). Can someone explain why? Is it type casting time? Is it a limitation of the CPU/FPU instruction set?

EDIT: My system is MATLAB R2012b on Mac OS X 10.8.3 , Intel Core i7 3.4 GHz

EDIT2: A few comments show that on this is only a problem with Mac OS X. I would like to compile results from diverse architectures and OS if possible.

EDIT3: My actual problem requires computation with a huge portion of all possible binary vectors of length m, where m can be too large for 8 * m * 2^m to fit in memory.

Amro
  • 123,847
  • 25
  • 243
  • 454
Memming
  • 1,731
  • 13
  • 25
  • 3
    I think this has to do with the fact that MATLAB implicitly converts the `logical` to `double` before multiplying by `A`. – Eitan T May 13 '13 at 17:01
  • 1
    Timings are different here: `0.161891 / 0.057331 / 0.049061` seconds. – Rody Oldenhuis May 13 '13 at 18:03
  • 1
    R2013a Vista32 Intel Core Duo T9300: `0.216960 / 0.026960 / 0.081925` seconds. – Oleg May 13 '13 at 18:27
  • 1
    I can confirm your timings with a very similar system (`0.018691 / 0.025646 / 0.036361`): Matlab R2012b, Mac OS X 10.8.3, Intel Core i7 (Retina MBP). I've seen this exact issue many times and have long suspected that Matlab wasn't doing a good job optimizing logical calculations in OS X. The naïve thing is for Matlab to recast data to the higher class, double, Matlab's native format, but there are obviously smarter schemes. In some cases the function `double` can be used to manually recast part or all of the logical data before a calculation -in your simple examples above this wont help. – horchler May 19 '13 at 00:21
  • 1
    You also may want to file a [bug report / service request](http://mathworks.com/support/service_requests/contact_support.do) with The MathWorks and cite this page. – horchler May 19 '13 at 00:21

3 Answers3

3

I'll start by posting a slightly better benchmark. I'm using the TIMEIT function from Steve Eddins to get more accurate timings:

function [t,err] = test_mat_mult()
    %# data
    N = 4000; sparsity = 0.7;    %# adjust size and sparsity of data
    x = double(rand(1,N) > sparsity);
    xl = logical(x);
    xs = sparse(x);
    A = randn(N);

    %# functions
    f = cell(3,1);
    f{1} = @() mult_func(x,A);
    f{2} = @() mult_func(xl,A);
    f{3} = @() mult_func(xs,A);

    %# timeit
    t = cellfun(@timeit, f);

    %# check results
    v = cellfun(@feval, f, 'UniformOutput',true);
    err = max(abs(v-mean(v)));  %# maximum error
end

function v = mult_func(x,A)
    v = x * A * x';
end

Here are the results on my machine (WinXP 32-bit, R2013a) with N=4000 and sparsity=0.7:

>> [t,err] = test_mat_mult
t =
     0.031212    %# double
     0.031970    %# logical
     0.071998    %# sparse
err =
   7.9581e-13

You can see double is only slightly better than logical on average, while sparse is slower than both as expected (since its focus is efficient memory usage not speed).


Now note that that MATLAB relies on BLAS implementations optimized for your platform to perform full-matrix multiplication (think DGEMM). In the general case, this includes routines for single/double types but not booleans, so conversion will occur which would explain why its slower for logical.

On Intel processors, BLAS/LAPACK routines are provided by the Intel MKL Library. Not sure about AMD, but I think it uses the equivalent ACML:

>> internal.matlab.language.versionPlugins.blas
ans =
Intel(R) Math Kernel Library Version 10.3.11 Product Build 20120606 for 32-bit applications

Of course the sparse case is a different story. (I know MATLAB uses SuiteSparse package for many of its sparse operations, but I'm not sure).

Amro
  • 123,847
  • 25
  • 243
  • 454
  • 1
    here is a very interesting question that you might also like: [Naive C++ Matrix Multiplication 100 times slower than BLAS?](http://codereview.stackexchange.com/questions/20980/naive-c-matrix-multiplication-100-times-slower-than-blas) – Amro May 23 '13 at 21:36
2

I think the results are reasonably related to the different representations.

A non-sparse double array is simple and efficient for representing a small body of data that fits very easily in cache.

A logical array is more space-efficient, using only a byte per element instead of 8 bytes, but that does not gain anything when you only have 8 elements. On the other hand, it does have to be converted to double before doing double arithmetic using it, adding a step.

A sparse array uses a more complicated representation, designed to save space when most of the array is zero. It requires more operations to either decide that the element at a given index is zero, or to obtain its non-zero value. Using it for a 50% non-zero array that fits easily in even the smallest caches is a misuse. It is at its best for reducing the memory and data transfer cost of a large array that is almost all zero. See Sparse vs Normal Array Matlab

If you are really dealing with 8 element arrays, you should stick with non-sparse arrays of double. If your real work involves larger arrays, you need to benchmark with similar sizes. You also need to make sure the sparseness of your test data matches the real data.

Community
  • 1
  • 1
Patricia Shanahan
  • 25,849
  • 4
  • 38
  • 75
  • Thank you for the answer. My problem is not super sparse (slightly over 50% of zeros) and pretty big. I'm considering a large collection of binary vectors of length greater than 100. It usually takes a few minutes on my computer, but I want to make sure that it's not too slow for everybody. – Memming May 19 '13 at 15:19
  • I think your next step should be a more realistic benchmark. Use realistic sizes, the number of vectors as well as the vector length. Your current test measures repeated access to very short vectors, which may underestimate the memory footprint impact of using double. – Patricia Shanahan May 19 '13 at 15:22
2

When you are working with data that fits entirely in cache and isn't too sparse (as in your benchmark), doing extra work (like converting between a logical type and double, or using sparse storage schemes) to try to reduce memory footprint will only slow your code down (as you have noticed).

Data accesses from L1 cache are fast enough as to be "effectively free" when there is a sufficient amount of computational work being done per data element loaded (as is the case in your example). When this happens, execution speed is limited by computation, not by load/store traffic; by using logical variables, you are doing more computation, which slows down your benchmark.

How big is the working set in the problem that you actually want to solve? If it's not at least bigger than the L2 cache on your processor, you should just use normal double matrices. The exact threshold at which using logical variables becomes advantageous is likely considerably larger, but would require some experiment to determine. (It will also depend on exactly how MATLAB handles the conversion; you want to do the conversion as part of the tiling for the multiplications--if MATLAB doesn't do that, it will likely never be faster than using double, no matter how big the data set is).

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • Thank you for the answer. My problem doesn't fit in the L2 cache in all cases. It is supposed to be a generic toolbox for fitting probabilistic models, so I wouldn't know in advance how big the problem is, or how big the user's L1/2 cache will be. From the other comments, it seems that it is highly OS or architecture dependent which is another problem for this kind of optimization. – Memming May 19 '13 at 15:16