1

I have a table of user entries, and for every entry I have an array of (2-byte) integers to store (15-25, sporadically even more). The array elements will be written and read all at the same time, it is never needed to update or to access them individually. Their order matters. It makes sense to think of this as an array object.

I have many millions of these user entries and want to store this with the minimum possible amount of disk space. I'm however struggling with MySQL's lack of Array datatype.

I've been considering the following options.

  • Do it the MySQL way. Make a table my_data with columns user_id, data_id and data_int. To make this efficient, one needs an index on user_id, totalling well over 10 bytes per integer.
  • Store the array in text format. This takes ~6.5 bytes per integer.
  • making 35-40 columns ("enough") and having -32768 be 'empty' (since this value cannot occur in my data). This takes 3.5-4 bytes per integer, but is somewhat ugly (as I have to impose a strict limit on the number of elements in the array).

Is there a better way to do this in MySQL? I know MySQL has an efficient varchar type, so ideally I'd store my 2-byte integers as 2-byte chars in a varchar (or a similar approach with blob), but I'm not sure how to do that. Is this possible? How should this be done?

user1111929
  • 6,050
  • 9
  • 43
  • 73
  • 1
    You could try the `varbinary` type. – Turbo J Apr 21 '16 at 23:47
  • Thanks, looks promising. How should I safely convert an array of 2-byte integers (in PHP) from/to binary, since arrays in PHP are not typed? Or should I let MySQL do that, and how? – user1111929 Apr 21 '16 at 23:50
  • 3
    Are you using a client language like PHP or C to access the database? You can use PHP `pack()` to convert an array into a binary string, which you then store in the `VARBINARY` column. – Barmar Apr 22 '16 at 00:05
  • Yes, but how can I tell `pack()` that the array consists of 2-byte integers? By default, `pack()` packs the PHP array, which consists of keys mapping to objects which happen to be integers; costs waaaay more than 2 bytes per entry (and would probably be the worst or second worst in the list if carried out in that way). – user1111929 Apr 22 '16 at 00:12
  • 1
    @user1111929 The first argument to `pack` is a format string, that's where you indicate the types. – Barmar Apr 22 '16 at 00:23
  • Aha! I was misusing pack by entering an array, when I do `echo bin2hex(pack("s*",-2322,253,2,0,3253,-22222,-11111));` I can see that it indeed produces a string of the correct length. That works, thanks! – user1111929 Apr 22 '16 at 00:33
  • 1
    You can use `call_user_func_array()` to expand an array to arguments. – Barmar Apr 22 '16 at 00:37
  • Will there **ever** be a need to search or do computations on the integer "array" data? – wallyk Apr 22 '16 at 00:44
  • No, it's storage only (write once, read a few times). Searching/computing will only happen offline on my laptop, when making statistics out of them, that's how rare it is. – user1111929 Apr 22 '16 at 00:46

2 Answers2

0

You could store them as separate SMALLINT NULL columns.

In MyISAM this this uses 2 bytes of data + 1 bit of null indicator for each value.

In InnoDB, the null indicators are encoded into the column's field start offset, so they don't take any extra space, and null values are not actually stored in the row data. If the rows are small enough that all the offsets are 1 byte, then this uses 3 bytes for every existing value (1 byte offset, 2 bytes data), and 1 byte for every nonexistent value.

Either of these would be better than using INT with a special value to indicate that it doesn't exist, since that would be 4 bytes of data for every value.

See NULL in MySQL (Performance & Storage)

Community
  • 1
  • 1
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • 2-byte integers can be stored in a SMALLINT, so only requiring 2 bytes per entry. The problem is that this forces me to hard-limit the number of variables in the array. So I'm afraid your suggestion is just a slightly suboptimal version of my option #3. – user1111929 Apr 22 '16 at 00:19
  • But you also need the `NULL` indicator to indicate that the value doesn't exist in that array element. That's an extra bit – Barmar Apr 22 '16 at 00:20
  • No, -32768 would mean it doesn't exist. That value can never occur (entries are limited between 30k and -30k) so I can safely use it as a replacement for the null bit. And given that more than 50% of the values is not null, half the entries taking 3 bytes and the other half 1 byte is slightly worse than all of them taking 2 bytes. – user1111929 Apr 22 '16 at 00:23
  • In InnoDB, all the entries will take 3 bytes if none of them are null. Every column has 1 byte for the field start offset, and the non-null columns have 2 bytes of data. – Barmar Apr 22 '16 at 00:26
  • Even if null is not allowed? – user1111929 Apr 22 '16 at 00:35
  • The question doesn't mention that the range of meaningful values is not the full range of 2 byte integers. You should add that, since the whole reason I suggested using `NULL` was because I thought you needed to be able to represent them all. – Barmar Apr 22 '16 at 00:36
  • True, I'll add it. – user1111929 Apr 22 '16 at 00:38
0

The best answer was given in the comments, so I'll repost it here with some use-ready code, for further reference.

MySQL has a varbinary type that works really well for this: you can simply use PHP's pack/unpack functions to convert them to and from binary form, and store that binary form in the database using varbinary. Example code for the conversion is below.

function pack24bit($n) { //input: 24-bit integer, output: binary string of length 3 bytes
    $b3 = $n%256;
    $b2 = $n/256;
    $b1 = $b2/256;
    $b2 = $b2%256;
    return pack('CCC',$b1,$b2,$b3);
}

function unpack24bit($packed) { //input: binary string of 3 bytes long, output: 24-bit int
    $arr = unpack('C3b',$packed);
    return 256*(256*$arr['b1']+$arr['b2'])+$arr['b3'];
}
user1111929
  • 6,050
  • 9
  • 43
  • 73