Because you have neglected to account for the effects of vectorized operations and prefetching for small matrices.
Notice the size of your matrices (10 x 10) is small, so the time needed to allocate temporary storage is not that significant (yet), and for processors with large cache sizes, these small matrices can probably still fit into L1 cache completely, so the speed gain from performing vectorized operations etc. for these small matrices will more than make up for the time lost allocating a temporary matrix and the speed gain from adding directly into one of the allocated memory location.
But when you increase the sizes of your matrices, the story becomes different
In [41]: k = 100
In [42]: a1, a2 = np.random.random((k, k)), np.random.random((k, k))
In [43]: %timeit func2(a1, a2)
4.41 µs ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [44]: %timeit func1(a1, a2)
6.36 µs ± 4.18 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [45]: k = 1000
In [46]: a1, a2 = np.random.random((k, k)), np.random.random((k, k))
In [47]: %timeit func2(a1, a2)
1.13 ms ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [48]: %timeit func1(a1, a2)
1.59 ms ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [49]: k = 5000
In [50]: a1, a2 = np.random.random((k, k)), np.random.random((k, k))
In [51]: %timeit func2(a1, a2)
30.3 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [52]: %timeit func1(a1, a2)
94.4 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit: This is for k = 10
to show that what you observed for small matrices is also true on my machine.
In [56]: k = 10
In [57]: a1, a2 = np.random.random((k, k)), np.random.random((k, k))
In [58]: %timeit func2(a1, a2)
1.06 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [59]: %timeit func1(a1, a2)
500 ns ± 0.149 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)