i have a library and a lot of projects depending on that library. I want to optimize certain procedures inside the library using SIMD extensions. However it is important for me to stay portable, so to the user it should be quite abstract. I say at the beginning that i dont want to use some other great library that does the trick. I actually want to understand if that what i want is possible and to what extent.
My very first idea was to have a "vector" wrapper class, that the usage of SIMD is transparent to the user and a "scalar" vector class could be used in case no SIMD extension is available on the target machine. The naive thought came to my mind to use the preprocessor to select one vector class out of many depending on which target the library is compiled. So one scalar vector class, one with SSE (something like this basically: http://fastcpp.blogspot.de/2011/12/simple-vector3-class-with-sse-support.html) and so on... all with the same interface. This gives me good performance but this would mean that i would have to compile the library for any kind of SIMD ISA that i use. I rather would like to evaluate the processor capabilities dynamically at runtime and select the "best" implementation available.
So my second guess was to have a general "vector" class with abstract methods. The "processor evaluator" function would than return instances of the optimal implementation. Obviously this would lead to ugly code, but the pointer to the vector object could be stored in a smart pointer-like container that just delegates the calls to the vector object. Actually I would prefer this method because of its abstraction but I'm not sure if calling the virtual methods actually will kill the performance that i gain using SIMD extensions.
The last option that i figured out would be to do optimizations whole routines and select at runtime the optimal one. I dont like this idea so much because this forces me to implement whole functions multiple times. I would prefer to do this once, using my idea of the vector class i would like to do something like this for example:
void Memcopy(void *dst, void *src, size_t size)
{
vector v;
for(int i = 0; i < size; i += v.size())
{
v.load(src);
v.store(dst);
dst += v.size();
src += v.size();
}
}
I assume here that "size" is a correct value so that no overlapping happens. This example should just show what i would prefer to have. The size-method of the vector object would for example just return 4 in case SSE is used and 1 in case the scalar version is used. Is there a proper way to implement this using only runtime information without loosing too much performance? Abstraction is to me more important than performance but as this is a performance optimization i wouldn't include it if would not speedup my application.
I also found this on the web: http://compeng.uni-frankfurt.de/?vc Its open source but i dont understand how the correct vector class is chosen.