9

I have a problem for which I have eight elements that can contain 0, 1, or 2. I can easily represent this in 16 bits, but for SIMD efficiency reasons, I need it to occupy 13 bits (it is not the only thing present in the lane).

Fortunately, 2^13==8192, and 3^8==6561, so the states I want can fit. However, here's where things get interesting. Naively, I would just represent these states by counting the ternary numeral states. For example, to represent the tritmask 0t12211012 (I'll use this as an example thoughout), I could just write 0t12211012 = 2*3^0+1*3^1+0*3^2+1*3^3+1*3^4+2*3^5+2*3^6+1*3^7 = 4244 = 0b1000010010100.

I have a set of operations I need to support:

  • Modify trits. This is easy in the default representation. For instance, if I have tritmask 0t12211012 and I wish to place a 2 in the position holding a zero, I can simply add 0t200=18. (Note that the conversion to tritspace is easy, because I only have 8 trits, so I can store the base powers in a register and index it with pshufw).
  • Find all elements set to a particular value. For example, given the tritmask 0t12211012, I want to be able to extract the bitmask for 0, which is 0b00000100, for 1, which is 0b10011010, and for 2, which is 0b01100001. This I have not figured out how to do, and is what I would like assistance with. How can I do this in a small number of operations suitable for x86 SIMD?

Thank you!

Edit 11/18/20: To give an example of an approach I consider too slow: we can iteratively find the value mod 3 and divide by 3 to pull trits off the least-significant end of the representation, then assemble the mask that way. C++ snippet:

uint32_t trits = <something>;
uint8_t mask0 = 0, mask1 = 0, mask2 = 0;
for (uint8_t shift = 0; shift < 8; ++shift) {
  const uint32_t remainder = trits % 3;
  mask0 |= (!remainder) << shift;
  mask1 |= (remainder == 1) << shift;
  mask2 |= (remainder == 2) << shift;
  trits /= 3;
}

When actually writing this in a SIMD language, we would use the standard multiply-and-shift trick for division by a constant. But you can see it's linear in the number of trits, and has a lot of ops per iteration. We could code-golf this down a bit, but I think it is fundamentally the wrong approach. It should ideally be possible to do something in parallel for each trit... but I don't see it.

Edit 11/20/20: I've made a halfhearted effort to apply Aha to this problem without success. Maybe an interesting subproblem to solve instead is - is there a short sequence of bitwise ops under the same constraints as above that acts as a 'ternary bitwise AND'? That is, an op that compares two encoded numbers in tritspace and returns a bitmask that is 1 when the corresponding trits are equal and zero otherwise? That would be a primitive from which we could build up the ops needed. We have left and right shift in tritspace (just multiply or divide by 3); and we have +/- a value. So what we are missing is the ability to test if trits are particular values...

Octavianus
  • 421
  • 4
  • 12
  • 1
    If you have your number in binary, but need the base 3 digits, AFAIK there's no more efficient way than repeated division and mod (using a multiplicative inverse, of course). Same as when you need the decimal digits of a binary integer. You're probably better off using BCD, err BCT I guess you'd call it, with 2-bit fields for each trit, even if it means lower data density (like 2 vector to hold all your data, or keeping the 3 bits of metadata(?) in a separate uint8_t array, like array of structs style. You can `pmovzxbw` to load from that 8-bit array and line up metadata with trit elements) – Peter Cordes Nov 19 '20 at 03:06
  • To go from 8 _trits_ to 8 quadinary, perhaps a `uint16_t three2four[6561]` pre-computed array? From 4 quadinary to 4 trits , maybe 2 stages of `unsigned uint8_t four2three[256]`? – chux - Reinstate Monica Nov 19 '20 at 03:18
  • 1
    Sounds like you have to make a choice: the convenient representation on 16 bits, or the compressed representation on 13 bits. Now you are asking for the 13 bit representation but without the inconvenience! – Stef Nov 19 '20 at 04:03
  • 2
    you can pack 5 trits into 8 bits (3^5 = 243 < 256). But you still need to use a division to extract the trits. You can use a smaller lookup table than `uint16_t three2four[6561]` though – phuclv Nov 19 '20 at 04:42
  • 3
    [This](https://hal.archives-ouvertes.fr/hal-02103214/) paper may be interesting, though they focus on trits with value set {-1, 0, 1} rather than {0, 1, 2}. – Mark Dickinson Nov 19 '20 at 09:10
  • 1
    If you want to SIMDify your algorithm, you could broadcast the input to 8 lanes, and divide element `k` by `3^k` and calculate the remainder mod 3 of that (all using inverse-multiplication, of course) -- there are probably more elegant solutions, though. – chtz Nov 19 '20 at 15:19
  • PeterCordes: Unfortunately, this is a field in a 32-bit lane, where these three bits are pushing it to 35 bits. I agree that this can be stored separately. chux-ReinstateMonica: This works, but is not vectorizable. Stef: That is in fact what I am hoping to avoid. phuclv: Acknowledged. MarkDickinson: Indeed this paper is very interesting. They are focused on constructing an *arbitrary* mapping between bitspace and tritspace, so it loses all the properties that makes it usable in tritspace, though :( chtz: Yes; but this is homeomorphic to doing a slow op in parallel in SIMD. – Octavianus Nov 21 '20 at 06:37

0 Answers0