1

I need to perform bitwise operations on tibble columns containing strings of bits. For example, I would like to be able to do something like

ds <- tibble(Id=1:2, X1=c("001", "011"), X2=c("101", "110"))
mutate(ds, X1andX2 = magic.AND(X1,X2))

to obtain

# A tibble: 2 x 4
      Id    X1    X2   X1andX2
    <int> <chr> <chr>   <chr>
1     1    001   101     001
2     2    011   110     010

I am operating on the data sets that are not particularly large (~50k rows) but I have to perform this operation many times. So, I'm looking for something more or less efficient and simple.

Since I have to run many join and group operations, I would prefer an approach compatible with dplyr.

Edit: Sorry, the example above is not very good as three-bit strings produce results that look like three-bit strings after casting to integers and padding with 0s (see Sotos's answer that almost works). Also, it would be nice to see a solution for long strings, i.e. more than 32 bits. Here is a better example.

ds <- tibble(Id=1:2, X1=c("0101", "1110"), X2=c("1110", "0110"))

The output

# A tibble: 2 x 4
      Id    X1    X2   X1andX2
    <int> <chr> <chr>   <chr>
1     1    0101  1110    0100
2     2    1110  0110    0110
rbrisk
  • 306
  • 1
  • 7
  • 2
    how do `001` & `101` give `001`? – Sotos Jul 26 '16 at 14:24
  • Element-wise, i.e. 0 & 1 => 0, 0 & 0 => 0, 1 & 1 => 1 – rbrisk Jul 26 '16 at 14:35
  • 1
    If you’re after performance the obvious first question is why you store bits in character strings. Store them in bitvectors (= integers) instead, and perform true bit operations. – Konrad Rudolph Jul 26 '16 at 15:11
  • 1
    @Sotos … by the conventional rules of bit arithmetic. – Konrad Rudolph Jul 26 '16 at 15:13
  • @Konrad Rudolph That crossed my mind but I had to keep it somewhat user-friendly. If somebody else opens a data file for a quick look, they need to be able to easily differentiate between say "0101000" (many 0s) and "1110111" (many 1s). – rbrisk Jul 26 '16 at 15:17
  • @Zneshkod'Vadu Well there’s a difference between between processing and output (in fact, the classical definition of electronical data processing distinguishes three stages: input, processing, output): process your data in an efficient representation and then format it more user friendly in the output. – Konrad Rudolph Jul 26 '16 at 15:24
  • @KonradRudolph In this particular case, input is the output from another process. So, I have to convert back and forth. However, I cannot find any R packages that would do that. Perhaps, I should just bite the bullet and just write it in Perl. – rbrisk Jul 27 '16 at 05:10

2 Answers2

1

Package bitops makes these operations easy,

library(bitops)
ds$X1_X2 <- sprintf('%03d', bitAnd(ds$X1, ds$X2))
ds
# A tibble: 2 x 4
#     Id    X1    X2 X1_X2
#  <int> <chr> <chr> <chr>
#1     1   001   101   001
#2     2   011   110   010
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • Level 2: Make this dplyr friendly. – zx8754 Jul 26 '16 at 15:03
  • Sorry, I provided a bad example. This approach works for three-bit strings but it fails with longer strings. Also, I may need to operate on very long strings while BitOps casts everything to 32-bit integers. I expanded the question to be more clear. – rbrisk Jul 26 '16 at 15:05
0

I gave up on a simple solution. Following Konrad Rudolph suggestion, I wrote two conversion functions. The first one was inspired by atesghnagfbvgfr's answer to another question.

intToBitStr <- Vectorize(function(x, bitN) {
    i <- 0
    v <- integer(bitN)
    while(x > 0) {
        v[bitN - i] <- x %% 2
        x <- x %/% 2
        i <- i + 1 
    }
    return(paste0(v, collapse=""))
}, c("x"), USE.NAMES = F)

bitStrToInt <- Vectorize(function(x) {
    v <- rev(as.integer(strsplit(x, "")[[1]]))
    acc <- 0
    for (i in 1:length(v)) {
        acc <- acc + v[i] * 2^(i - 1)
    }
    return(acc)
}, USE.NAMES = F)

Using these two functions, the solution would be something like

mutate(ds, X1Int = bitStrToInt(X1), X2Int = bitStrToInt(X2)) %>%
mutate(X1andX2 = intToBitStr(bitwAnd(X1Int, X2Int), bitN=4)) %>%
select(-X1Int, -X2Int)

It may not be very efficient and I have not tested it yet. If it ends up being too slow, I will just write everything in Perl.

Community
  • 1
  • 1
rbrisk
  • 306
  • 1
  • 7