27

Our web server needs to process many compositions of large images together before sending the results to web clients. This process is performance critical because the server can receive several thousands of requests per hour.

Right now our solution loads PNG files (around 1MB each) from the HD and sends them to the video card so the composition is done on the GPU. We first tried loading our images using the PNG decoder exposed by the XNA API. We saw the performance was not too good.

To understand if the problem was loading from the HD or the decoding of the PNG, we modified that by loading the file in a memory stream, and then sending that memory stream to the .NET PNG decoder. The difference of performance using XNA or using System.Windows.Media.Imaging.PngBitmapDecoder class is not significant. We roughly get the same levels of performance.

Our benchmarks show the following performance results:

  • Load images from disk: 37.76ms 1%
  • Decode PNGs: 2816.97ms 77%
  • Load images on Video Hardware: 196.67ms 5%
  • Composition: 87.80ms 2%
  • Get composition result from Video Hardware: 166.21ms 5%
  • Encode to PNG: 318.13ms 9%
  • Store to disk: 3.96ms 0%
  • Clean up: 53.00ms 1%

Total: 3680.50ms 100%

From these results we see that the slowest parts are when decoding the PNG.

So we are wondering if there wouldn't be a PNG decoder we could use that would allow us to reduce the PNG decoding time. We also considered keeping the images uncompressed on the hard disk, but then each image would be 10MB in size instead of 1MB and since there are several tens of thousands of these images stored on the hard disk, it is not possible to store them all without compression.

EDIT: More useful information:

  • The benchmark simulates loading 20 PNG images and compositing them together. This will roughly correspond to the kind of requests we will get in the production environment.
  • Each image used in the composition is 1600x1600 in size.
  • The solution will involve as many as 10 load balanced servers like the one we are discussing here. So extra software development effort could be worth the savings on the hardware costs.
  • Caching the decoded source images is something we are considering, but each composition will most likely be done with completely different source images, so cache misses will be high and performance gain, low.
  • The benchmarks were done with a crappy video card, so we can expect the PNG decoding to be even more of a performance bottleneck using a decent video card.
sboisse
  • 4,860
  • 3
  • 37
  • 48
  • What's [`Image.FromFile()`](http://msdn.microsoft.com/en-us/library/stf701f5.aspx)'s performance? Or: what do you mean by "decoding"? – CodeCaster Jul 03 '12 at 14:38
  • Using uncompressed textures is not a good idea in a game (I am guessing this because of XNA?). Even with faster decoders your still going to be slow, especially at 10mb. Another problem is that if your sending this uncompressed to the GPU, the GPU then has latency dealing with it. You want textures to load fast in games or else you will get low frame rates because its trying to open textures, or you might get texture poping. – Serguei Fedorov Jul 03 '12 at 14:39
  • 17
    +1 for actually profiling – Joey Jul 03 '12 at 14:39
  • 1
    have you tried different PNG encoding to see the impact on performance (interlaced, 24bit, less efficient compression) – joocer Jul 03 '12 at 14:47
  • 1
    @sboisse another alternative is to Cache the uncompressed images. I'd do some stats of what images are used and when, and check what would be the cache hit ratio then. If you allocate something like 10 GBs for the cache on disk, that's 1,000 image. – M Afifi Jul 03 '12 at 14:52
  • The decodning part is the time it takes for the file data to be converted to a bitmap usable for presentation on screen, or by the GPU. Texture2D.FromFile() and using the WPF classes roughly gave the same level of performance. – sboisse Jul 03 '12 at 14:56
  • @user1260028: this is not a game. It is a webserver that must do many image compositions to be sent to web browsers. We use XNA and the GPU because image composition is faster than doing it on the CPU. We benchmarked CPU vs GPU performance and we found the the overhead of sending the images to and from the GPU was worth it since composition was done much quicker this way. – sboisse Jul 03 '12 at 14:58
  • @ M Afifi: we considered the option of caching the images, and we might get a slightly better performance doing so in the production environment. However, the image compositions will almost always done using completely different images, so cache miss rates would be really high. – sboisse Jul 03 '12 at 15:02
  • @joocer: haven't tried yet. We will investigate and come back with results. – sboisse Jul 03 '12 at 15:04
  • 1
    I'd store the images as binary, pre-decoded files you can load and feed to GPU immediately. If they take up 10MB each, you can store almost 100 thousand of them per TB (I fixed my math) – Alex Jul 03 '12 at 15:04
  • 1
    @sboisse I didn't say Cache the compositions, but the source images after decoding (the slowest step). Exactly as Alex suggested. – M Afifi Jul 03 '12 at 15:23
  • Given that this is on a server, what kind of video card do you have? "The difference of performance using XNA or using System.Windows.Media.Imaging.PngBitmapDecoder class is not significant" smells like the video card is a typical "not worth mentioning" item which means the GPU side is quite exactly irrelevant. – TomTom Jul 03 '12 at 15:37
  • @M Afifi: Yes I had understood you perfectly. But there will be tens if not hundreds of thousands of source images available. And from one composition to the other, these images will not be the same and cache miss will be very high. – sboisse Jul 03 '12 at 15:51
  • @TomTom: Actually our tests are done with a pretty crappy video card. So we should expect the transfers to/from GPU and GPU rendering time to decrease with a decent video card. With such setup, the CPU will be even more of a bottleneck. – sboisse Jul 03 '12 at 15:55
  • Depens nowt how crappy, but how recent - ever since people start using video cards for calculations, the info paths FROM THE CARD TO MEMORY was sped up significantly. I would say you likely DO NOT USE THE GRAPHICS CARD AT ALL - it can be your code falls back to software rendering? I would check that. – TomTom Jul 03 '12 at 15:56
  • What did you implement after all? – BitBank Dec 01 '14 at 10:31
  • Our customer did not have the budget to implement a more optimized PNG decoder so we had to stick with the decoder provided out of the box in .NET. – sboisse Dec 02 '14 at 14:36

4 Answers4

6

There is another option. And that is, you write your own GPU-based PNG decoder. You could use OpenCL to perform this operation fairly efficiently (and perform your composition using OpenGL which can share resources with OpenCL). It is also possible to interleave transfer and decoding for maximum throughput. If this is a route you can/want to pursue I can provide more information.

Here are some resources related to GPU-based DEFLATE (and INFLATE).

  1. Accelerating Lossless compression with GPUs
  2. gpu-block-compression using CUDA on Google code.
  3. Floating point data-compression at 75 Gb/s on a GPU - note that this doesn't use INFLATE/DEFLATE but a novel parallel compression/decompression scheme that is more GPU-friendly.

Hope this helps!

Ani
  • 10,826
  • 3
  • 27
  • 46
  • Suer this is feasible? I tried that when i was reading the question, but google does not turn out any reference i could find to that. – TomTom Jul 03 '12 at 15:34
  • Also note - this is for a serve. Servers with decent GPU are kind of rare, and server level cards kind of VERY expensive. Could be a dead end for web server image processing. – TomTom Jul 03 '12 at 15:36
  • @TomTom: Says who that server level cards are expensive? You don't need a "workstation card." A plain consumer card works fine. The last place I interned at used a cluster of consumer GPUs in their production server. – Mike Bailey Jul 03 '12 at 15:36
  • 1
    Ah, yes. Try fitting that one in a typical hosting level server. If you go and buy / rent a server, you can NOT put in a typical card, except low power - you must choose your ahrdware peroperly. Starts with 1u being really small, and 2U servers not having he power supplies for higher end graphics cards. Yes, you buil dservers for that you ahve them - I have a couple of Quad 6990 in another room doing volatility calculations. But look at server vendors and try fitting a GPU into a rack server and you get frustrated fast. – TomTom Jul 03 '12 at 15:42
  • I wasn't sure if this was feasible for the OP, hence the "can/want to pursue" bit. So @sboisse, is this an option for you? – Ani Jul 03 '12 at 15:46
  • We kind of have a good control on the hardware to be used for the servers. It just has to be decently cost effective. So we could probably consider a hardware solution that would allow us to decode the images on the GPU, provided we could get interesting performance results following that path. – sboisse Jul 03 '12 at 16:02
  • @TomTom, there is such a thing as a [1U GPU server](http://www.siliconmechanics.com/i27076/GPU-1U-Server.php). You can even [cram three of them in there](http://www.siliconmechanics.com/i38371/GPU-1U-Server.php). You're right about them being expensive and power-hungry, but they do exist. – Charles Jul 03 '12 at 17:19
  • Yes, there is. The question is whether the poster has one of the budget. I don't say they do not exist. I say that if you have a "Server" (not explicitly planned for that) you CAN NOT PUT IN A GPU. That simple. There simply is no space. SO, unless someone did PLAN that to use GPU, he simply has no possibility and likely no budget. – TomTom Jul 03 '12 at 17:25
  • 1
    @TomTom we understand your point, but the OP states this is possible if indeed beneficial. – Ani Jul 03 '12 at 17:34
4

Have you tried the following 2 things.

1)
Multi thread it, there is several ways of doing this but one would be a "all in" method. Basicly fully spawn X amount of threads, for the full proccess.

2)
Perhaps consider having XX thread do all the CPU work, and then feed it to the GPU thread.

Your question is very well formulated for being a new user, but some information about the senario might be usefull? Are we talking about a batch job or service pictures in real time? Do the 10k pictures change?

Hardware resources
You should also take into account what hardware resources you have at your dispoal. Normaly the 2 cheapest things are CPU power and diskspace, so if you only have 10k pictures that rarly change, then converting them all into a format that quicker to handle might be the way to go.

Multi thread trivia
Another thing to consider when doing multithreading, is that its normaly smart to make the threads in BellowNormal priority.So you dont make the entire system "lag". You have to experiment a bit with the amount of threads to use, if your luck you can get close to 100% gain in speed pr CORE but this depends alot on the hardware and the code your running.

I normaly use Environment.ProcessorCount to get the current CPU count and work from there :)

EKS
  • 5,543
  • 6
  • 44
  • 60
  • I don't understand the game engine comments. Given this "Our web server needs to process many compositions of large images together before sending the results to web clients" – M Afifi Jul 03 '12 at 14:54
  • Comments suggest game, but remmeber. Are we talking about serving pictures in real time? Or some sort of batch job :). If its a batch job Caching might not be usefull. But in real time caching is a huge pluss – EKS Jul 03 '12 at 15:06
  • @EKS The comments explicitly state that this is not for a game: `@user1260028: this is not a game. It is a webserver that must do many image compositions to be sent to web browsers.` I do realize that you posted your answer before that comment was written. – Esoteric Screen Name Jul 03 '12 at 15:08
  • Very interesting proposition. Using your suggestion we could decode 4 images simultaneously using a Quad core processor, so theorically we would divide decoding time by 4. But still, PNG decoding would probably remain a bottleneck, and if we could make this faster, it would be just better. – sboisse Jul 03 '12 at 15:12
  • Added some more to my reply. You should begin my setting a goal time to convert the pictures, and then work towards that. Perhaps multithreading can bring you half the way? – EKS Jul 03 '12 at 15:20
  • Multiple users are already concurrent, this will only help in having the 10 images a user selected being composed quicker. The overall concern of the webserver not being able thousands of requests in an hour is still legitimate. – M Afifi Jul 03 '12 at 15:25
  • 1
    It's a good idea but this is really one of those cases where multithreading is just a crutch where the actual issue is the underlying algorithm. It's possible that the decoder being used is suboptimal, in which case you shouldn't expect multithreading would just make things better. – Mike Bailey Jul 03 '12 at 15:33
  • @Mike Bantegui , I agree both angels should be attacked. I may be simple but i would say if it can be multi threaded it should :) There is also the complexiy involved, algoritme might be very hard – EKS Jul 03 '12 at 15:35
  • 1
    Ah, no. Sometimes brute for ce is needed. A decent AMD server can give you 32 cores to work wih for less money than thinking about the problem, a queueing mechanism could run any number of composers in a grid type environment. There are some problems that can only be solved by brute force. – TomTom Jul 03 '12 at 15:35
3

I've written a pure C# PNG coder/decoder ( PngCs ) , you might want to give it a look. But I higly doubt it will have better speed permance [*], it's not highly optimized, it rather tries to minimize the memory usage for dealing with huge images (it encodes/decodes sequentially, line by line). But perhaps it serves you as boilerplate to plug in some better compression/decompression implementantion. As I see it, the speed bottleneck is zlib (inflater/deflater), which (contrarily to Java) is not implemented natively in C# -I used a SharpZipLib library, with pure C# managed code; this cannnot be very efficient.

I'm a little surprised, however, that in your tests decoding was so much slower than encoding. That seems strange to me, because, in most compression algorithms (perhaps in all; and surely in zlib) encoding is much more computer intensive than decoding. Are you sure about that? (For example, this speedtest which read and writes 5000x5000 RGB8 images (not very compressible, about 20MB on disk) gives me about 4.5 secs for writing and 1.5 secs for reading). Perhaps there are other factor apart from pure PNG decoding?

[*] Update: new versions (since 1.1.14) that have several optimizations; if you can use .Net 4.5, specially, it should provide better decoding speed.

leonbloy
  • 73,180
  • 20
  • 142
  • 190
  • I think the decoding is longer than the encoding because the code is decoding say 100 images but only encoding 1. – joocer Jul 04 '12 at 09:23
  • Indeed joocer is right because our test loads 20 source images, but generates only one after composition is complete, hence the decoding time higher than the encoding time. – sboisse Jul 04 '12 at 17:47
  • Ah, that makes sense, sorry about that – leonbloy Jul 04 '12 at 17:49
  • User has relocated their project PngCs to https://github.com/leonbloy/pngcs – Prime Dec 11 '22 at 13:11
2

You have mutliple options

  • Improve the performance of the decoding process

    You could implement another faster png decoder (libpng is a standard library which might be faster) You could switch to another picture format that uses simpler/faster decodeable compression

  • Parallelize

    Use the .NET parallel processing capabilities for decoding concurrently. Decoding is likely singlethreaded so this could help if you run on multicore machines

  • Store the files uncompressed but on a device that compresses

    For instance a compressed folder or even a sandforce ssd. This will still compress but differently and burden other software with the decompression. I am not sure this will really help and would only try this as a last resort.

Community
  • 1
  • 1
IvoTops
  • 3,463
  • 17
  • 18
  • We could expect parallelization to improve decoding time, but since in peak time many requests are made per second, it should come up to the same amount of requests being treated per second by the server. Multithreading will be made at the request level instead of at the decoding level. – sboisse Jul 03 '12 at 16:10
  • We will look at libpng and see how it goes. Using some other lossless image format could be an interesting alternative. Any format to suggest? – sboisse Jul 03 '12 at 16:12
  • Not really. I cannot find much info on relative decompression times between picture formats. For JPG here is an interesting comparison which shows a factor 10 difference between the slowest and the fastest decoder... http://www.briancbecker.com/blog/2010/analysis-of-jpeg-decoding-speeds/ – IvoTops Jul 04 '12 at 11:57
  • A BMP file will load much faster than a PNG file, even though a BMP file is massively bigger. – Tara Jun 05 '16 at 22:56