-1

How can i encode in bytes a Language with instructions like "PUSH_k x" for an arbitrarily large x and k the number of bytes of x?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    It's unclear what you want but maybe you can put the length itself into the encoding, similar to length prefixed strings. – Jester Nov 07 '19 at 17:10
  • 3
    There are many ways to do this. – fuz Nov 07 '19 at 17:10
  • i just need one way, i've been looking for ages for it but nothing comes even close to my problem – Bernardo Varanda Nov 07 '19 at 17:16
  • if you have a byte sequence with your instructions say {DUP, POP, PUSH1 argument , PUSH2 argument, ...} and the arguments are also sequences of bytes how to encode a program? – Bernardo Varanda Nov 07 '19 at 17:17
  • As I said, put the length of the argument into the encoding. E.g. if your push has an opcode of `0x42` then `PUSH_2 0x1234` could look `0x42 0x02 0x12 0x34`. – Jester Nov 07 '19 at 17:47
  • How do you know what instruction is the first then (the encoding is arbitrarily large...)? If you were to do a a push_2048 (0x01).times2048 (example) how many bytes would it be? – Bernardo Varanda Nov 07 '19 at 18:31
  • 1
    Umm, the first is at the beginning. You can decode sequentially. That's how variable length instruction sets such as x86 work. Okay so you do need a fixed size for the length field itself unless you want to go with even fancier encodings. If you allow 32 bit size, your example would be `0x42 0x00 0x00 0x08 0x00 0x01 .... 0x01` which is 1 byte opcode + 4 byte length + 2048 byte constant. – Jester Nov 07 '19 at 18:33
  • 1
    Do you really mean *arbitrarily* large? Can `k` be greater than 256, so `k` itself might need multiple bytes to represent? There are variable-length number encoding schemes you can use for truly arbitrary values but you hopefully don't support that for immediate push. – Peter Cordes Nov 07 '19 at 18:44
  • @PeterCordes Yes, it is arbitrarily large. Although, my interest is only for theoretical reasons. I am not going to implement anything, just use it in a proof. Thank you for your advice on variable-length encodings, i will check them. – Bernardo Varanda Nov 08 '19 at 14:41
  • Erik's answer is a good example of that sort of thing. The other major option (instead of 1 bit per element means more elements) is for one value to be a sentinel / escape code that means a larger value follows. But that probably works best for non-arbitrary sizes. – Peter Cordes Nov 08 '19 at 14:45

1 Answers1

3

There are many approaches to encoding variable length numbers:

One simple approach is a variable length byte approach, which uses one bit of each byte to tell if another byte of the value follows.  For example, in each byte, use the high bit = 1 to say that there is at least one more byte — and use the high bit = 0 to say that this byte is the last byte.

The decoding, here into int type, goes like this:

int decodeVariableLengthNumber ( unsigned char *pc ) {
    int value = 0; // size here should accomodate the largest possible
    for(;;) {
       unsigned byte b = *pc++;
       value <<= 7;
       value |= b & 127;
       if ( b & 128 )
           continue;
       return value;
    }
}

And the encoding is simply the reverse as needed for this decoding.

This can encode 7 data/content bits in every 8 bit byte, so numbers from 0 to 127 in one byte, and 0 to 16383 in two bytes, for example.

Offsets can also be used so that the first byte encodes, 1..128 instead of 0..127 if that makes more sense for the application.  Also, the two byte format might assume one would not use the 2 byte encoding for a number that fits in the 1 byte form, so these might be offset by 127 (or 128), for a slightly larger reach.

You can encode k this way, and then have fixed size k bytes for x — or you can simply encode x itself directly in this variable length manner without necessarily a need for encoding a k for byte count.


Other food for thought to consider:

  • self synchronizing encodings, which could even have some security advantages in instruction set architecture.
  • UTF-8 A variable length encoding that has the self synchronization property.
  • RISC V RISC V supports variable length instructions that are multiples of 2 bytes.
Erik Eidt
  • 23,049
  • 2
  • 29
  • 53
  • Thank you, your suggestion seems very practical and I will see if i can adapt it to my problem. Also, I am going to look up the other encodings. – Bernardo Varanda Nov 08 '19 at 14:52