Erratic timings in Julia

Question

I am beginning to learn Julia after using Matlab for several years. I started by implementing a simple polynomial multiplication (without FFT) to try and understand the role of type stability. A big part of this project is the requirement for a fast polynomial multiplier. However, I have the following timings which I can't understand at all.

function cauchyproduct(L::Array{Float64},R::Array{Float64})
    # good one for floats
    N = length(L)
    prodterm = zeros(1,2N-1)
    for n=1:N
        Lterm = view(L,1:n)
        Rterm = view(R,n:-1:1)
        prodterm[n] = dot(Lterm,Rterm)
    end

    for n = 1:N-1
        Lterm = view(L,n+1:N)
        Rterm = view(R,N:-1:n+1)
        prodterm[N+n] = dot(Lterm,Rterm)
    end
    prodterm
end



testLength = 10000
goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j in 1:10
    @time cauchyproduct(goodL,goodR)
end
@which cauchyproduct(goodL,goodR)

I get the following timings from 2 sequential runs of this code. These timings from one run to another are completely erratic. In general, the timing I get per test can range between .05s to 2s. Typically, the timings for a single run through the for loop will all have similar timings (as in the example below), but even this isn't always the case. Occasionally, I have it alternate such as .05s .05s 1.9s .04s .05s 2.1s etc etc.

Any idea why this is happening?

  0.544795 seconds (131.08 k allocations: 5.812 MiB)
  0.510395 seconds (120.00 k allocations: 5.340 MiB)
  0.528362 seconds (120.00 k allocations: 5.340 MiB, 0.94% gc time)
  0.507156 seconds (120.00 k allocations: 5.340 MiB)
  0.507566 seconds (120.00 k allocations: 5.340 MiB)
  0.507932 seconds (120.00 k allocations: 5.340 MiB)
  0.527383 seconds (120.00 k allocations: 5.340 MiB)
  0.513301 seconds (120.00 k allocations: 5.340 MiB, 0.83% gc time)
  0.509347 seconds (120.00 k allocations: 5.340 MiB)
  0.509177 seconds (120.00 k allocations: 5.340 MiB)
  0.052247 seconds (120.00 k allocations: 5.340 MiB, 7.95% gc time)
  0.049644 seconds (120.00 k allocations: 5.340 MiB)
  0.047275 seconds (120.00 k allocations: 5.340 MiB)
  0.049163 seconds (120.00 k allocations: 5.340 MiB)
  0.049029 seconds (120.00 k allocations: 5.340 MiB)
  0.054050 seconds (120.00 k allocations: 5.340 MiB, 8.36% gc time)
  0.047010 seconds (120.00 k allocations: 5.340 MiB)
  0.051240 seconds (120.00 k allocations: 5.340 MiB)
  0.050961 seconds (120.00 k allocations: 5.340 MiB)
  0.049841 seconds (120.00 k allocations: 5.340 MiB, 4.90% gc time)

Edit: The timings shown are obtained by executing the code beneath the defined functions twice in a row. Specifically, the code block

goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j in 1:10
    @time cauchyproduct(goodL,goodR)
end

gives vastly different timings on different runs (without recompiling the functions above it). In all of the timings, the same method of cauchyproduct (the top version) is being called. Hopefully this clarifies the problem.

Edit 2: I changed the code block at the end to the following

testLength = 10000
goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j = 1:3
    @time cauchyproduct(goodL,goodR)
end


for j = 1:3
    goodL = rand(1,testLength)
    goodR = rand(1,testLength)
    @time cauchyproduct(goodL,goodR)
end

@time cauchyproduct(goodL,goodR)
@time cauchyproduct(goodL,goodR)
@time cauchyproduct(goodL,goodR)

and got the following timings on 2 repeated executions of the new block.

Timing 1:

  0.045936 seconds (120.00 k allocations: 5.340 MiB)
  0.045740 seconds (120.00 k allocations: 5.340 MiB)
  0.045768 seconds (120.00 k allocations: 5.340 MiB)
  1.549157 seconds (120.00 k allocations: 5.340 MiB, 0.14% gc time)
  0.046797 seconds (120.00 k allocations: 5.340 MiB)
  0.046637 seconds (120.00 k allocations: 5.340 MiB)
  0.047143 seconds (120.00 k allocations: 5.341 MiB)
  0.049088 seconds (120.00 k allocations: 5.341 MiB, 3.88% gc time)
  0.049246 seconds (120.00 k allocations: 5.341 MiB)

Timing 2:

  2.250852 seconds (120.00 k allocations: 5.340 MiB)
  2.370882 seconds (120.00 k allocations: 5.340 MiB)
  2.247676 seconds (120.00 k allocations: 5.340 MiB, 0.14% gc time)
  1.550661 seconds (120.00 k allocations: 5.340 MiB)
  0.047258 seconds (120.00 k allocations: 5.340 MiB)
  0.047169 seconds (120.00 k allocations: 5.340 MiB)
  0.048625 seconds (120.00 k allocations: 5.341 MiB, 4.02% gc time)
  0.045489 seconds (120.00 k allocations: 5.341 MiB)
  0.049457 seconds (120.00 k allocations: 5.341 MiB)

So confused.

I have edited the original post to try and clarify the situation. — SDK, Nov 14 '17 at 03:28
By a "run" I mean when I run the lower block with the for loop. Each of the 10 timings inside the for loop "usually" has similar timings. But rerunning that block of code (i.e. doing 10 additional timings) gives times which are all over the place. The block shown was running that for loop twice (total of 20 timings). — SDK, Nov 14 '17 at 03:47
It is per run. Each group of 10 timings has similar values for each timing. However, running the block again gives 10 new timings with vastly different values (as you noticed with the two blocks of 10 above). — SDK, Nov 14 '17 at 04:08
I'll admit the way the question is asked is a bit confusing, but I've no idea why you've been downvoted to -2. Overly harsh. I can duplicate these results on my machine, and the question is reasonable. It can be frustrating when downvoters don't explain their actions in the comments. I'm upvoting to try and get this back to 0 at least. — Colin T Bowers, Nov 14 '17 at 04:17
Incidentally, your "good" and "bad" functions will compile down to the same code if the input can be guaranteed to always be `Array{Float64,2}`, which it can in your current tests, so you will see no performance difference between the two. See [here](https://stackoverflow.com/questions/25009072/how-to-write-good-julia-code-when-dealing-with-multiple-types-and-arrays-mult) for more detail. — Colin T Bowers, Nov 14 '17 at 04:19
Given that you use random numbers, that'll be the (initial) cause of your different timings (as they are per block). Perhaps that shouldn't cause differences, but I'd start there. You may get infinity somewhere, or some other "inconvenient" numbers you use in your calculations (inconvenient for Julia, LLVM, the CPU, whatever). — , Nov 14 '17 at 04:28
I recommend removing the function you don't use, since that's only confusing. Just keep one function in your question, the one used for your timings. — , Nov 14 '17 at 04:29
You should also comment out parts of your function, e.g. whole loop(s), one or two lines inside a singe loop (and remove the other loop), to see which line in particular causes the variations in timings, and thus, which line(s) in particular depend on the randomness of your input. — , Nov 14 '17 at 04:30
Thanks for the helpful replies. I will remove the second method definition since as you say, it has no bearing on the current issue. I have been continuing to test this and it seems to only get weirder. I'll update the question with what I'm talking about. So far it seems unlikely I'll be able to use julia for this project which really bums me out. — SDK, Nov 14 '17 at 04:36
Your new timings are still consistent with the random input: the first 3 lines have identical input, the next has different input, so does the next, and then 4 lines with identical input again. So your timings would roughly be like `a, a, a, b, c, d, d, d, d`. The fact that c is (nearly) equal to d is a happy coincidence, and it's probably because d (in both your timing examples) is the best you can get. It'll be worth running your examples a few more times, to see whether the c timing is always nearly equal to d. — , Nov 14 '17 at 04:47
Also, I again suggest playing around with commenting out parts of the code, to see which line(s) precisely depend so much on the randomness of the input data (as this seems to be the most likely cause). — , Nov 14 '17 at 04:49
@Evert I'm fairly certain the issue is something to do with garbage collection relating to the fact that the code in the question is taking views into a 1xN matrix which is an odd thing to do given that Julia uses column-major ordering. See my answer for more detail. — Colin T Bowers, Nov 14 '17 at 05:08
@ColinTBowers While unfortunate garbage collections seem to be a likely culprit, it makes me wonder why the results are consistent per set of equal input (whether batches of 3, or of 10), and timings appear to change depending on variation of the input. While your code shows this doesn't have to be the case (all for the better), it just makes me curious about the underlying Julia (or LLVM?) machinery that causes this. — , Nov 14 '17 at 05:27
@Evert Yes, I'm not sure. I should probably add that I couldn't duplicate those outcomes on my machine. I got the occassional spike (presumably from garbage collection), but was unable to get any of the batches that OP got. — Colin T Bowers, Nov 14 '17 at 06:05

Colin T Bowers · Accepted Answer · 2017-11-14T05:11:32.203

Short answer: Your code is a bit odd, and so probably triggering garbage collection in unexpected ways, resulting in varied timings.

Long answer: I agree that the timings you are getting are a bit strange. I'm not completely sure I can nail down exactly what is causing the problem, but I'm 99% certain it is something to do with garbage collection.

So, your code is a bit odd, because you allow input arrays of any dimension, even though you then call the dot function (a BLAS routine for taking the dot product of two vectors). In case you didn't realise, if you want a vector, use Array{Float64,1}, and for a matrix Array{Float64,2} and so on. Or you could also uses the aliases Vector{Float64} and Matrix{Float64}.

The second odd thing I noticed is that in your test, you generate rand(1, N). This returns an Array{Float64,2}, i.e. a matrix. To get an Array{Float64, 1}, i.e. a vector, you would use rand(N). Then within your function you take views into your matrix, which are of size 1xN. Now, Julia uses column-major ordering, so using a 1xN object for a vector is going to be really inefficient, and is probably the source of your strange timings. Under the hood, I suspect the call to dot is going to involve converting these things into regular vectors of floats, since dot eventually feeds through to the underlying BLAS routine which will need this input type. All these conversions will mean plenty of temporary storage, which needs to be garbage collected at some point, and this will probably be the source of the varying timings (90% of the time, varied timings on the same code are the result of the garbage collector being triggered - and sometimes in quite unexpected ways).

So, there is probably several ways to improve the following, but my quick-and-dirty version of your function looks like this:

function cauchyproduct(L::AbstractVector{<:Number}, R::AbstractVector{<:Number})
    length(L) != length(R) && error("Length mismatch in inputs")
    N = length(L)
    prodterm = zeros(1,2*N-1)
    R = flipdim(R, 1)
    for n=1:N
        prodterm[n] = dot(view(L, 1:n), view(R, N-n+1:N))
    end
    for n = 1:N-1
        prodterm[N+n] = dot(view(L, n+1:N), view(R, 1:N-n))
    end
    return prodterm
end

Note, I flip R before the loop so the memory doesn't need to be re-ordered over and over within the loop. This was no doubt contributing to your strange garbage collection issues. Then, applying your test (I think it is a better idea to move array generation inside the loop in case some clever caching issue throws off timings):

testLength = 10000
for j = 1:20
    goodL = rand(testLength);
    goodR = rand(testLength);
    @time cauchyproduct(goodL,goodR);
end

we get something like this:

  0.105550 seconds (78.19 k allocations: 3.935 MiB, 2.91% gc time)
  0.022421 seconds (40.00 k allocations: 2.060 MiB)
  0.022527 seconds (40.00 k allocations: 2.060 MiB)
  0.022333 seconds (40.00 k allocations: 2.060 MiB)
  0.021568 seconds (40.00 k allocations: 2.060 MiB)
  0.021837 seconds (40.00 k allocations: 2.060 MiB)
  0.022155 seconds (40.00 k allocations: 2.060 MiB)
  0.022071 seconds (40.00 k allocations: 2.060 MiB)
  0.021720 seconds (40.00 k allocations: 2.060 MiB)
  0.024774 seconds (40.00 k allocations: 2.060 MiB, 9.13% gc time)
  0.021714 seconds (40.00 k allocations: 2.060 MiB)
  0.022066 seconds (40.00 k allocations: 2.060 MiB)
  0.021815 seconds (40.00 k allocations: 2.060 MiB)
  0.021819 seconds (40.00 k allocations: 2.060 MiB)
  0.021928 seconds (40.00 k allocations: 2.060 MiB)
  0.021795 seconds (40.00 k allocations: 2.060 MiB)
  0.021837 seconds (40.00 k allocations: 2.060 MiB)
  0.022285 seconds (40.00 k allocations: 2.060 MiB)
  0.021380 seconds (40.00 k allocations: 2.060 MiB)
  0.023828 seconds (40.00 k allocations: 2.060 MiB, 6.91% gc time)

The first iteration is measuring compile time, rather than runtime, and so should be ignored (if you don't know what I mean by this then check the performance tips section of the official docs). As you can see, the remaining iterations are much faster, and quite stable.

Thanks Colin and Evert for your helpful comments. Your solution seems to have fixed this particular issue. The bigger issue is I have a lot to learn and years of bad habits from Matlab as evidenced by this simple example. The reason I didn't specify the array dimension is because this function will actually have to perform higher dimensional cauchy products so the dot() calls will be replaced with .* and sum() calls. I wanted to get a 1d version working first. One last question: I had been using view() since an intro Julia guide said it was the right thing to do. Is that not the case here? — SDK, Nov 14 '17 at 14:04
@SDK I came to Julia in 2014 from a full-time Matlab background. I would say it took me a year of working in Julia full-time to train myself out of bad Matlab habits. Two quick tips: 1) Look at what I did in the signature with my `Vector{<:Number}` (or you could do `(Vector{T}) where {T<:Number}`). Using the type system like this is the core of multiple dispatch. 2) If you find yourself writing long functions (standard Matlab practice), then you're probably doing julia wrong. Julia lends itself to short functions, often one-liners! — Colin T Bowers, Nov 14 '17 at 22:28
@SDK Yes, using `view` is a good idea here. `view` creates a view into an array over the specified dimensions, which can then be passed around and have operations performed on it as if it were a smaller array. This avoids the need to make a copy of the original array, and so can massively speed up some routines. You can verify that my function above is about 10 times slower without `view`. Where `view` is not so good is when the subarray you need is not contiguous in memory, eg `view(randn(5), 4:-1:2)`, or `view(randn(5)[5,3])`... TBC — Colin T Bowers, Nov 14 '17 at 22:32
@SDK In these cases, it will usually be faster to make a new copy of the subarray that will be contiguous in memory and so operations (especially BLAS operations) can take advantage of the contiguity to work much more quickly. More reading [here](https://stackoverflow.com/questions/28271308/avoid-memory-allocation-when-indexing-an-array-in-julia) — Colin T Bowers, Nov 14 '17 at 22:33

Erratic timings in Julia

1 Answers1