I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
- AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
- integrated GPUs which share main memory with the host;
- a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.