Hex string to integer conversion in Amazon Redshift

Question

Amazon Redshift is based on ParAccel which is based on Postgres. From my research it seems that the preferred way to perform hexadecimal string to integer conversion in Postgres is via a bit field, as outlined in this answer.

In the case of bigint, this would be:

select ('x'||lpad('123456789abcdef',16,'0'))::bit(64)::bigint

Unfortunately, this fails on Redshift with:

ERROR: cannot cast type text to bit [SQL State=42846]

What other ways are there to perform this conversion in Postgres 8.1ish (that's close to the Redshift level of compatibility)? UDFs are not supported in Redshift and neither are array, regex functions or set generating functions...

Thanks for clearly stating you're using Redshift and saving us the "what're you really using" shuffle. Now, if only Amazon would donate a Redshift instance to http://sqlfiddle.com/ so we could actually *test stuff on it*. I suspect the best answer to this, as with most data-wrangling on Redshift, is going to be "do it in the client". — Craig Ringer, Jan 02 '14 at 17:49
@CraigRinger I will ping Amazon about donating an instance to SQLFiddle. — Sim, Jan 02 '14 at 20:59
@CraigRinger See http://twitter.com/rahulpathak/status/420320138986913792 — Sim, Jan 06 '14 at 23:44

score 11 · Accepted Answer · edited Jun 20 '20 at 09:12

11

It looks like they added a function for this at some point: STRTOL

Syntax

STRTOL(num_string, base)

Return type

BIGINT. If num_string is null, returns NULL.

For example

SELECT strtol('deadbeef', 16);

Returns: 3735928559

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 27 '14 at 23:11

Dean

8,632
6
45
61

Good to see them moving fast. – Sim Mar 01 '14 at 01:31

Craig Ringer · Answer 2 · 2014-01-03T01:55:59.513

Assuming that you want a simple digit-by-digit ordinal position conversion (i.e. you're not worried about two's compliment negatives, etc) I think this should work on an 8.1-equivalent DB:

CREATE OR REPLACE FUNCTION hex2dec(text) RETURNS bigint AS $$
SELECT sum(CASE WHEN v >= ascii('a') THEN v - ascii('a') + 10 ELSE v - ascii('0') END * 16^ordpos)::bigint
FROM (
    SELECT n-1, ascii(substring(reverse($1), n, 1))
    FROM generate_series(1, length($1)) n
) AS x(ordpos, v);
$$ LANGUAGE sql IMMUTABLE;

The function form is optional, it just makes it easier to avoid repeating the argument a bunch of times. It should get inlined anyway. Efficiency will probably be awful, but most of the tools available to do this smarter don't seem to be available on versions that old, and this at least works:

regress=> CREATE TABLE t AS VALUES ('c13b'), ('a'), ('f');
regress=> SELECT hex2dec(column1) FROM t;
 hex2dec 
---------
   49467
      10
      15
(3 rows)

If you can use regexp_split_to_array and generate_subscripts it might be faster. Or slower. I haven't tried. Another possible trick is to use a digit mapping array instead of the CASE, like:

'[48:102]={0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,11,12,13,14,15}'::integer[]

which you can use with:

CREATE OR REPLACE FUNCTION hex2dec(text) RETURNS bigint AS $$
SELECT sum(
  ('[48:102]={0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,11,12,13,14,15}'::integer[])[ v ]
  * 16^ordpos
)::bigint
FROM (
    SELECT n-1, ascii(substring(reverse($1), n, 1))
    FROM generate_series(1, length($1)) n
) AS x(ordpos, v);
$$ LANGUAGE sql IMMUTABLE;

Personally, I'd do it client-side instead, rather than wrangling the limited capabilities of an old PostgreSQL fork, especially one you can't load your own sensible user-defined C functions on, or use PL/Perl, etc.

In real PostgreSQL I'd just use this:

hex2dec.c:

#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"
#include "errno.h"
#include "limits.h"
#include <stdlib.h>

PG_MODULE_MAGIC;

Datum from_hex(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(hex2dec);

Datum
hex2dec(PG_FUNCTION_ARGS)
{
    char *endpos;
    const char *hexstr = text_to_cstring(PG_GETARG_TEXT_PP(0));
    long decval = strtol(hexstr, &endpos, 16);
    if (endpos[0] != '\0')
    {
        ereport(ERROR, (ERRCODE_INVALID_PARAMETER_VALUE, errmsg("Could not decode input string %s as hex", hexstr)));
    }
    if (decval == LONG_MAX && errno == ERANGE)
    {
        ereport(ERROR, (ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE, errmsg("Input hex string %s overflows int64", hexstr)));
    }
    PG_RETURN_INT64(decval);
}

Makefile:

MODULES = hex2dec
DATA = hex2dec--1.0.sql
EXTENSION = hex2dec

PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

hex2dec.control:

comment = 'Utility function to convert hex strings to decimal'
default_version = '1.0'
module_pathname = '$libdir/hex2dec'
relocatable = true

hex2dec--1.0.sql:

CREATE OR REPLACE FUNCTION hex2dec(hexstr text) RETURNS bigint
        AS 'hex2dec','hex2dec'
        LANGUAGE c IMMUTABLE STRICT;

COMMENT ON FUNCTION hex2dec(hexstr text)
IS 'Decode the hex string passed, which may optionally have a leading 0x, as a bigint. Does not attempt to consider negative hex values.';

Usage:

CREATE EXTENSION hex2dec;

postgres=# SELECT hex2dec('7fffffffffffffff');
       hex2dec       
---------------------
 9223372036854775807
(1 row)

postgres=# SELECT hex2dec('deadbeef');
  hex2dec   
------------
 3735928559
(1 row)

postgres=# SELECT hex2dec('12345');
 hex2dec 
---------
   74565
(1 row)

postgres=# select hex2dec(to_hex(-1));
  hex2dec   
------------
 4294967295
(1 row)

postgres=# SELECT hex2dec('8fffffffffffffff');
ERROR:  Input hex string 8fffffffffffffff overflows int64

postgres=# SELECT hex2dec('0x7abcz123');
ERROR:  Could not decode input string 0x7abcz123 as hex

The performance difference is ... noteworthy. Given sample data:

CREATE TABLE randhex AS 
SELECT '0x'||to_hex( abs(random() * (10^((random()-.5)*10)) * 10000000)::bigint) AS h
FROM generate_series(1,1000000);

conversion from hex to decimal takes about 1.3 from a warm cache using the C extension, which isn't great for a million rows. Reading them without any transformation takes 0.95s. It took 36 seconds for the SQL based hex2dec approach to process the same rows. Frankly I'm really impressed that the SQL approach was as fast as that, and surprised the C ext was that slow.

UDFs are not supported in Redshift and neither are array, regex functions or set generating functions... Any other ideas? — Sim, Jan 02 '14 at 20:56
@Sim Use the client side, sounds like the only option if Redshift can't do anything useful. Can't really fiddle with it any more unless you can offer access to a Redshift account for ongoing testing, since it clearly has all sorts of restrictions on top of just being based on an ancient PostgreSQL. — Craig Ringer, Jan 03 '14 at 01:56

score 1 · Answer 3 · edited May 23 '17 at 11:59

A likely explanation is that the cast from text to bit(n) relies on undocumented behavior, I repeat the quote from Tom Lane:

This is relying on some undocumented behavior of the bit-type input converter, but I see no reason to expect that would break. A possibly bigger issue is that it requires PG >= 8.3 since there wasn't a text to bit cast before that.

And Amazon derivate is obviously not allowing this undocumented feature. Not surprising, since it is based off of Postgres 8.1 where there was no cast at all.

Previously quoted in this closely related answer:
Convert hex in text representation to decimal number

Thanks for clarifying the reason for the behavior. Do you have ideas on how to work around the issue? — Sim, Jan 02 '14 at 20:53

Hex string to integer conversion in Amazon Redshift

3 Answers3

Linked