First - profile the code. You're left with little more than speculation if you cannot definitively identify the bottlenecks.
Next, scour the documentation for libjpeg speedup opportunities. You mentioned scale_num
and scale_denom
. What about the decompressor's dct_method
? I've found that the DCT_FASTEST
option is good. There are other options to check: do_fancy_upsampling
, do_block_smoothing
, dither_mode
, two_pass_quantize
, etc. Some of all of these may be useful to you, depending on your system, libjpeg version, etc.
If profiling tools are unavailable, there are still some things to try. First, I suspect your bottlenecks are non-CPU related. To confirm, load the uncompressed image into a RAM buffer, then decompress it from there as you have been. Did that significantly improve the decompression time? If so, the culprit would appear to be the read operation from your image storage media. Depending on your system, reading from USB (or SD, etc) can be slow. (Note that I'm assuming a read from external media - although hardware details are scant.) Be sure to optimize relevant bus parameters, as well (SPI clocks, configurations, etc).
If you are reading from something like internal flash (i.e. NAND), there are some other things to inspect. How is your NAND controller configured? Have you ensured that the controller is configured for the fastest operation? Check wait states, timings, etc. Note that bus and/or memory contention can be an issue, too - so inspect their respective configurations, as well.
Finally, if you believe your system is actually CPU bound, this stackoverflow question may be of interest:
Can a high-performance jpeglib-turbo implmentation decompress/compress in <100ms?