Is there a way I can improve performance by using special prefetch instructions and if so where/how to deploy them?

Question

I have written a recursive descent parser, so one that is not table-driven, and it typically only takes a very small amount of input text which might only be 30-100 code units (optionally either 16-bit or 32-bits each) with the total depending greatly on the presence of comments within the text that it is parsing. I am wondering if I can improvement performance by using prefetch instructions on x86-64 and AArch64 - prfm instruction iirc ? If so how and where should I deploy them ?

I have started reading the article https://stackoverflow.com/questions/8126311/how-much-of-what-every-programmer-should-know-about-memory-is-still-valid/47714514#47714514 I don’t understand the terms hardware- and software- prefetch. I’m presuming that one of these means using special CPU prefetch instructions. Maybe another one means using normal instruction fetches. Who knows? The article says don’t use one or both types of prefetching as they’re as likely to harm as to help and the strategy is very difficult to develop. The latter I am finding already, and the former means I’m out of luck. — Cecil Ward, Aug 04 '23 at 21:15
Very likely not. Looking at a small amount of data in sequential order, with plenty of computation in between, is best-case for caching and HW prefetchers. If performance counters show any cache misses, you could try a wrapper outside the recursion that prefetches the 64B cache line *after* the one containing the first byte of the string. e.g. `__builtin_prefetch(uft16_str+32, 0 /* read-only */, 3 /* L1 cache */)`. So it can be fetching in parallel with demand loads for the first line. — Peter Cordes, Aug 04 '23 at 21:17
I should try to make sure that the input is aligned on a 64B boundary but it’s unlikely to be something I can control. Peter, is that a GCC builtin? I will be wanting the equivalent from LLVM LDC (the LLVM D compiler). That will save me having to use my own x86 inline asm wrappers for the prefetch instructions and developing an AArch64 equivalent. Does the success of this depend on how much memory read bandwidth is left after code fetching eats it all up? — Cecil Ward, Aug 04 '23 at 21:26
Yes, that's a GNU C extension. Pick whatever equivalent exists in whatever language you're using, if you want to try it. Re: alignment, yes that would be nice, but not worth extra copying so don't expect it. If your string starts at `ptr`, `ptr+64` is in the second cache line, regardless of offset within cache line. (prefetch instructions take a byte address and prefetch the line that contains it. e.g. x86's https://www.felixcloutier.com/x86/prefetchh are all documented with a `m8` operand. Because a single byte is always in exactly one cache line.) — Peter Cordes, Aug 04 '23 at 21:34
I will be fetching either longs (32-bits) or words for UTF-32 or UTF-16 codepoints, so if I’m really unlucky the caller could present me with input data that is off by one and not even on a 2- or 4-byte boundary, as per templated version. The input text is almost certain to be a literal string in a read-only RAM section (addressed with [RIP+offset] on x86-64). I don’t know about alignment in that case. Will find out what the D compilers do. — Cecil Ward, Aug 04 '23 at 21:48
Right, to get `+64` in asm, use `ptr + 64/sizeof(wchar_t)` or whatever. In most systems, `alignof(wchar_t) == sizeof(wchar_t)`, which I assume would be the same in D, so it would be undefined behaviour for a caller to pass you a `wchar_t *` that wasn't aligned to a wchar boundary. Regardless, none of that is an obstacle to prefetching the second cache line given a pointer to the start of the string. — Peter Cordes, Aug 04 '23 at 21:53
Also, you're using compile-time-constant string indices, allowing `[rip + rel32]` addressing modes? Usually a parser needs to use variables to keep track of positions so the same instruction can run on different bytes. — Peter Cordes, Aug 04 '23 at 21:54

Is there a way I can improve performance by using special prefetch instructions and if so where/how to deploy them?

0 Answers0