Ty AJD and pgSystemTester. I confirm AJD results for Integer 33 and Long 20. Yet Long is much much faster only because your test program is small enough to fit entirely in the processor cache memory. The fastest ram is L1 cache and that is only 32kB for data and 32kB for program. Testing Byte, Integer and Long for one or several arrays that barely fit into 32kB of L1 data cache will give you totally different or opposite results regarding the speeds.
In my case for the same arrays that totals 120kB for Integers and 240kB for Long i had the same result for Long as for Integer.
That is because changing to Long the same arrays totals double the size compared with Integer arrays, and so, more and more of data had fall outside L1 cache due to the change to Long. To reach data outside L1 cache took much more clocks or time.
Therefore your test is good only as a test yet in real life is misleading as msdn.microsoft recommendation to use Long regardless. Also those who emphasize that ram size is double for Long had not emphasize the consequence for the processor waiting time to reach data outside of L1 cache or even worst outside L2 cache or outside L3 cache. For each outside L1, L2 and L3 the time to reach the data will increase dramatically and this is most important for speed.
To summarize:
if your data fit inside the L1 cache then Long is fastest but
that is only 4k of data times 4Bytes for Long = 16kB (because other programs and OS will populate the rest of 32k of L1 cache),
Byte and Integer will drastically increase the speed for arrays in size
of at least 16kB, because changing to Long will increase size and this
will force more data to reside outside the fastest L1 cache ram.
Try the same test but instead of Dim a As Byte use Dim a() As Byte , example:
Dim a() As Byte, b() As Byte, c() As Byte
ReDim a(7, 24, 60), b(24, 7, 60), c(24, 60, 7)
Dim h As Long, loops As Long: Dim i As Long, j As Long, k As Long ' these i, j, k always As Long
loops=1
For h = 1 To loops
For i = 1 To 6: For j = 0 To 23: For k = 1 To 58
a(i, j, k) = a(i + 1, j, k): b(j, i, k) = b(j, i - 1, k)
c(j, k, i) = a(i - 1, j, k + 1) + b(j, i - 1, k - 1)
Next k: Next j: Next i
For i = 6 To 1 Step -1: For j = 23 To 0 Step -1: For k = 58 To 1 Step -1
a(i, j, k) = a(i + 1, j, k): b(j, i, k) = b(j, i - 1, k)
c(j, k, i) = a(i - 1, j, k + 1) + b(j, i - 1, k - 1)
Next k: Next j: Next i
Next h
First set "loops" to 1 to see how long it takes. Then increase it gradually aiming for several seconds for As Bytes. It will take longer for As Integer, and even longer for As Long...
The size of each of the 3 array is 8x25x61 = 12200 and this is
12200 kB times 3 = 36600 kB for As Byte ,
24400 kB times 3 = 73200 kB for As Integer ,
48800 kB times 3 = 146400 kB for As Long .
Run the same code with Dim a() As Integer, b() As Integer, c() As Integer, then the same with Dim a() As Long, b() As Long, c() As Long and so on.
Now if you increase one dimension 20 times you will expect a 20 increase in duration but it will be much more because now data will fall outside the L2 cache (1MB shared for all 4 cores).
If you increase one dimension 200 times you will expect a 200 increase in duration but it will be again much more because now data will fall outside the L3 cache (6-8MB shared for all 4 cores and the same 8MB for 8 cores or 16MB if ryzen 5800...).
I can't understand why after 20 years or more the L1 cache is only 64kB when it can be at least 16x16=256kB . With 16bits for row address and 16bits for column address you have to read only 32bits and that is one read for 32 bits processor. I suspect that is because perhaps the core still works on 16bits (8 for row + 8 for column address, 8x8=64kB) or worst on only 8bits.
After testing please post your results.