My project are heavily using logsumexp
in the algorithm. Currently I'm using this library https://github.com/rmcgibbo/logsumexp , which is implemented in SSE instruction set.
However, modern Intel CPU has much powerful AVX instruction sets. Hence, I would like to know if there's any faster logsumexp
implementation by AVX or even CUDA for Python?
Thank you.