Efficiently implementing DXT1 texture decompression in hardware

Question

DXT1 compression is designed to be fast to decompress in hardware where its used in texture samplers. The Wikipedia article says that under certain circumstances you can work out the co-efficients of the interpolated colours as:

c2 = (2/3)*c0+(1/3)*c1

or rearranging that:

c2 = (1/3)*(2*c0+c1)

However you re-arrange the above equation, then you end up always having to multiply something by 1/3 (or dividing by 3, same deal even more expensive). And it seems weird to me that a texture format which is designed to be fast to decompress in hardware would require a multiplication or division. The FPGA I'm implementing my GPU on only has limited resources for multiplications and I want to save those for where they're really required.

So am I missing something? Is there an efficient way of avoiding the multiplications of the colour channels by a 1/3? Or should I just eat the cost of that multiplication?

Mike Vine · Answer 1 · 2019-06-06T15:38:23.840

MY best answer I can come up with is that I can use the identity:

x/3 = sum(n=1 to infinity) (x/2^(2n))

and then take the first n terms. Using 4 terms I get:

(x/4)+(x/16)+(x/64)+(x/256)

which equals

x*0.33203125

which is probably good enough.

This relies on multiplication by a fixed power of 2 being free in hardware, then 3 additions of which I can run 2 in parallel.

Any better answer is appreciated though.

** EDIT **: Using a combination of this and @dyslexicgruffalo's answer I made a simple c++ program which iterated over the various sequences and tried them all and recorded the various average/max errors.

I did this for 0 <= x <= 189 (as 189 is the value of 2*c0.g + c1.g when g (which is 6 bits) maxes out.

The shortest good sequence (with a max error of 2, average error of 0.62) and is 4 ops was:

1 + x/4 + x/16 + x/64.

The best sequence which had a max error of 1, average error of 0.32, but is 6 ops was:

x/2 - x/4 + x/8 - x/16 + x/32 - x/64.

For the 5 bit values (red and blue) the maximum value is 31*3 and the above sequences are still good but not the best. These are:

x/4 + x/8 - x/16 + x/32 [max error of 1, average 0.38]

and

1 + x/4 + x/16 [max error of 2, average of 0.68]

(And, luckily, none of the above sequences ever guesses an answer which is too big so no clamping is needed even though they're not perfect)

Apologies @Mike Vine, I think we have written the same answer of sorts! — dyslexicgruffalo, Jun 06 '19 at 10:07

score 1 · Accepted Answer · answered Jun 06 '19 at 10:07

1

This might be a bad way of imagining it, but could you implement it via the use of addition/subtraction of successive halves (shifts)?

As you have 16 bits this gives you the ability to get quite accurate with successive additions and subtractions.

A third could be represented as

a(n+1) = a(n) +/- A>>1, where, the list [0, 0, 1, 0, 1, etc] shows whether to add or subtract the shifted result.

I believe this is called fractional maths.

However, in FPGAs, it is difficult to know whether this is actually more power efficient than the native DSP blocks (e.g. DSP48E1) provided.

answered Jun 06 '19 at 10:07

dyslexicgruffalo

364
1
4
18

I'm beginning to think this is (both our ways) a good way of doing it. I misread the documentation and although you do the compare as a 16bit integer you do the interpolate (with the 1/3 co-efficients) separately for each colour. That means the maximum size of the value is 6 bits [0-63]. So if I'm careful I could probably get either full accuracy or maybe +/-1 of the required result like this. As for not using the DSPs for me its not power, its just the number of them is limited and I want to have as many simultaneous tiles rendering as possible and DSPs is my limiting factor for that. – Mike Vine Jun 06 '19 at 10:23
(this is a fun home project GPU I'm making so I can choose to ignore power - https://mikevine.net/) – Mike Vine Jun 06 '19 at 10:24

Efficiently implementing DXT1 texture decompression in hardware

2 Answers2