1

Using the bit64 package, I am trying to create 64-bit integer hashes using xxhash64, similarly to pyspark's pyspark.sql.functions.xxhash64.

With the digest package, I can create the xxhash64 in the form of a string representation of a hex, however I am not able to convert this to an integer64:

str(digest::digest("foo", algo = "xxhash64"))
# chr "26af78b925a80acb"

For some strings, I can use as.numeric, but that has problems with precision and NAs produced by integer64 overflow:

xxHash <- digest::digest("barfoo", algo = "xxhash64")
xxHash
# [1] "8107d13ec8130cad"
bit64::as.integer64(as.numeric(paste0("0x", xxHash)))
# integer64
# [1] <NA>

Is there a way to get xxhash64 values in the integer64 data type?

Jozef
  • 2,617
  • 14
  • 19
  • 1
    Does this answer your question? [How do I convert 64-bit hexadecimal strings in R?](https://stackoverflow.com/questions/26302706/how-do-i-convert-64-bit-hexadecimal-strings-in-r) – Mark Jun 23 '23 at 09:36
  • No, as Davor pointed out: https://stackoverflow.com/questions/26302706/how-do-i-convert-64-bit-hexadecimal-strings-in-r#comment104789265_26304030 – Jozef Jun 23 '23 at 09:37
  • I would recommend using ‘Rmpfr’ instead of ‘bit64’: the latter does not allow natively converting from hexadecimal, it has no unsigned integer type, its implementation codebase is messy and, at a glance, does not appear to be very high quality, and I also just found an arithmetic bug in the library (`bit64::as.integer64(2L) ^ (63L - 4L) != bit64::as.integer64(2L) ^ 63 %/% 16L`). – Konrad Rudolph Jun 23 '23 at 14:02

2 Answers2

3

For hexadecimal values that fit into 63 bits (i.e. strictly positive signed values), you can use the following conversion function:

hex_to_int64 = function (x) {
  hex_digits = setNames(0 : 15, c(0 : 9, letters[1 : 6]))

  Reduce(
    \(value, digit) value * 16L + unname(hex_digits[digit]),
    strsplit(sub('^0x', '', x), '')[[1L]],
    bit64::integer64(1L)
  )
}

However, this won’t work for the value from your example (0x8107d13ec8130cad), since that exceeds 263 − 1 (0x7fffffffffffffff) and thus overflows the signed 64 bit integer range.

Unfortunately ‘bit64’ does not provide an unsigned integer type, nor does it provide a two’s complement function or bit shift, so manually implementing the above function for values ≥ 263 is rather onerous. I suggest using a different package.

The easiest solution I could find that supports values ≥ 263 does a detour via binary representation and then tests whether the 64th digit is 1:

hex_to_int64_signed = function (x) {
  hex_digits = setNames(0 : 15, c(0 : 9, letters[1 : 6]))
  bin_digits = t(expand.grid(0 : 1, 0 : 1, 0 : 1, 0 : 1))

  from_hex = function (x) unname(hex_digits[x])
  to_bin = function (x) rev(bin_digits[, x + 1L])

  complement = function (x) {
    x = 1L - x

    carry = 1L
    for (i in rev(seq_along(x))) {
      res = x[i] + carry
      x[i] = res %% 2L
      carry = res %/% 2L
    }
    x
  }

  # Step 1: convert to binary representation
  binary = unlist(lapply(
    strsplit(sub('^0x', '', x), '')[[1L]],
    \(d) to_bin(from_hex(d))
  ))

  stopifnot(`result does not fit into 64 bits` = length(binary) <= 64L)

  # Step 2: from binary to integer64

  add = 0L
  sign = 1L
  if (length(binary) == 64L && binary[1L] == 1L) {
    sign = -1L
    binary = complement(binary)

    # FIXME: This does not work due to a bug in ‘bit64’
    if (binary[1L] == 1L) {
      add = -bit64::as.integer64(2L) ^ 63L
      sign = 0L
    }
  }

  add + sign * Reduce(\(x, d) x * 2L + d, utils::tail(binary, 63L), bit64::as.integer64(0L))
}

We can now calculate the integer64 value of 0x8107d13ec8130cad:

> hex_to_int64_signed('0x8107d13ec8130cad')
integer64
[1] -9149114050405004115

However, note the FIXME: comment! A bug in ‘bit64’ prevents it from being able to represent the smallest valid 64 bit signed integer, which has the value −263:

> bit64::as.integer64('-9223372036854775807')
integer64
[1] -9223372036854775807

> bit64::as.integer64('-9223372036854775808')
integer64
[1] <NA>

> -bit64::as.integer64(2L) ^ 63L
integer64
[1] -9223372036854775807
#                      ^ THIS IS WRONG!

As a consequence, the hex_to_int64_signed function fails on the entirely valid input 0x8000000000000000, but it works on both sides of it, i.e. 0x7fffffffffffffff (= 9223372036854775807) and 0x8000000000000001 (= −9223372036854775807).

… as I keep saying: use a different package.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
1

The as.numeric is double precision and has only 53 bits, so that it will result in rounding errors in the last [0, 4] digits since 2^9 bits is missing where the possible values it can take is 0 to 512, for more details in base precision as.numeric see (https://stat.ethz.ch/R-manual/R-devel/library/base/html/double.html)

Actually it happens also in your first example of "foo", lets use the as.numeric() and bit64::as.integer64 together;

bit64::as.integer64(as.numeric(paste0("0x", "26af78b925a80acb")))
# 2787579430961678848

Yet, it is not quite correct due to the 53 bit precision. The correct integer64 representation for that hexadecimal is 2787579430961679051. You can get this number using Rmpfr package as follows;

library(Rmpfr)
mpfr(paste0("0x", "26af78b925a80acb"), base=16)
# 2787579430961679051

Same for xxHash;

mpfr(paste0("0x", xxHash), base=16)
# 9297630023304547501
Slybot
  • 588
  • 1
  • 11
  • Thanks, this does work, however I asked specifically for a solution that results in the integer64 data type, because my use case requires that for inter-operability reasons. – Jozef Jun 23 '23 at 11:37
  • 1
    @Jozef Unfortunately you really *can’t* do this meaningfully using ‘bit64’, since ‘bit64’ integers are signed, and the value in your example does not fit into a signed 64 bit integer without overflow (put differently, the result would be negative). – Konrad Rudolph Jun 23 '23 at 13:43