In addition to the other answers, and emphasizing one important point again: The question is very broad, and the performance will depend on many factors that you did not mention in the question. You might already be aware of these factors, but in doubt, a summary may be found in this answer (the question might seem unrelated at the first glance, and refers to CUDA, but many concepts apply to OpenCL as well)
One of the main driving ideas behind OpenCL is heterogeneous computing (remarkably, the page does not even mention OpenCL...). That is, OpenCL aims at offering the developer the possibility to exploit all available processing resources, ranging from a single ARM core to multiple high-end GPUs with thousands of cores.
This versatility comes at a cost. Certain concepts are implicitly tailored for many-core architectures (or at least, this seems to be the main application area until now). And in any case, "optimizing" an OpenCL program often just means "tweaking it in order to run particularly fast on one certain architecture". Things like vectorization or shared memory may be advantageous on one platform or not available at all on the other.
There are some possibilities to circumvent this, or to at least try to make one OpenCL program more "agnostic" of the hardware that it will run on. One obvious option is to query the target platform properties (for example, preferred vector sizes or whether shared memory is available), and launch different kernels depending on the result. Thanks to the built-in compiler in OpenCL, it is even possible to include platform specific optimizations, for example, via #define
s, into the kernel source code. However, it is hard to make general statements about the effort-to-performance-gain ratio for this kind of optimizations. And it's even harder to predict whether the possibly reduced performance of a "generic" OpenCL implementation (compared to a perfectly tweaked C-implementation) will not sooner or later be compensated, when the number of cores increase and the OpenCL compilers become better.
So my recommendation would be to do some benchmarks of "representative" tasks, and see whether the performance is competitive among the different devices, keeping in mind that the average number of cores of each device (and, most likely, the general heterogenity of the devices) will increase, and OpenCL might make it easier to adapt to these changes.