1

I have a data frame that looks like this:

SNP1 01010101000000100000010010001010011001010101
SNP2 01010010101000100000000000000001100001001000
SNP3 01010101000000000000000000000100011111111111

... but that in reality contains ~8 million rows, and each binary vector is of length 1000 each.

I need to select specific positions in these binary vectors (across all rows). The dirty way I found to do this was to remove row names, convert each digit into a column, and then create an object containing the positions I am interested.

The following works fine with sample data, but it is not very efficient with my real data (it's running for a long time now). Any ideas how I can make it faster?

library(data.table)
library(stringr)
setwd("test/")
DATADIR="datadir/"
OUTPUTDIR="outputdir/"
dir.create(OUTPUTDIR, showWarnings = FALSE)

baseline<-read.table(paste0(DATADIR,"input.file"), colClasses = "character")
  # Pass BP name to row name (so that I can split the binary vector into multiple columns)
  row.names(baseline) <- baseline$V1
  baseline$V1 <- NULL

  # split cells containing the binary vectors into multiple columns - thank you @Onyambu for this!
  baseline_new <-  read.table(text = gsub('(.)','\\1 ',baseline$V2),fill=TRUE)

  # select columns of interest
  columns_to_keep <- c(1, 4, 8, 10)
  baseline_new_ss <- baseline_new[, columns_to_keep]

  # create new object containing a column with the original row names, then recreate binary vector based on subsetted binary positions. 
  baseline_final <- as.data.frame(row.names(baseline))
  baseline_final$V2 <- as.character(interaction(baseline_new_ss,sep=""))

Output (selecting only positions 1, 4, 8 and 10) should look like:

SNP1 0110
SNP2 0100
SNP3 0110

I am sure there's a less convoluted way of doing this.

Thank you!!

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 5
    Consider it a fixed width file with 1000 columns of width one each, and [use one of these approaches to read it that way directly](https://stackoverflow.com/a/34190156/903061). You've got `library(data.table)` which is a good idea, but you're not using it at all. `data.table::fread` is one of the fastest choices for reading fixed width files at my link, and it will create a `data.table` object which should be the fastest for subsequent operations too. But you'll want to read the intro to data.table vignette to learn how to use it. – Gregor Thomas Jan 09 '20 at 16:05
  • 1
    That said, depending on your subsequent operations, you may do better to use `iotools`, the fastest fixed width reader at my link, and making it a `matrix` rather than a `data.table`. Or a sparse matrix, if the ones are scarce (from your example, they look pretty common, so stick with a regular matrix.) – Gregor Thomas Jan 09 '20 at 16:11
  • 2
    It may also be worth looking at `vroom::vroom_fwf()`. Not sure how it would perform in the benchmarks that Gregor linked but it is another (potentially fast) option. [Link for more info on `vroom::vroom_fwf()`](https://vroom.r-lib.org/articles/vroom.html#reading-fixed-width-files) – Andrew Jan 09 '20 at 16:20

3 Answers3

1

You could try this:

at <- function(binary_strings, positions)
{
  charvec <- character(length(binary_strings))
  for(i in seq_along(positions))
  {
    charvec <- paste0(charvec, substr(binary_strings, positions[i], positions[i]))
  }
  return(charvec)
}

Now you can do

at(baseline$`whatever your binary column is called`, c(1, 4, 8, 10))
#> [1] "0110" "0100" "0110"

So with the pipe you can do

library(magrittr)

baseline$`whatever your binary column is called` %<>% at(c(1, 4, 8, 10))

print(baseline)
#>      whatever your binary column is called
#> SNP1                                  0110
#> SNP2                                  0100
#> SNP3                                  0110

I've benchmarked this at 7 seconds on 8 million rows using a pretty slow Windows PC.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • This chunk of code is amazing, thank you! It worked exactly like I wanted and it's super fast! It is also way less convoluted than my old code (I need to start learning how to create these functions!) – Rodrigo Duarte Jan 09 '20 at 17:31
1

You could use strsplit, select elements with mapply and paste it back together into a data frame. Don't know how fast this is, though, but it is concise:)

`rownames<-`(data.frame(values=
                          mapply(function(x) Reduce(paste0, x[c(1, 4, 8, 10)]), 
                                 sapply(dat$V2, strsplit, ""))),
         dat$V1)
#      values
# SNP1   0110
# SNP2   0100
# SNP3   0110

Maybe there's a data.table solution around that doesn't make copies inside -> fast.


Data:

"SNP1 01010101000000100000010010001010011001010101
SNP2 01010010101000100000000000000001100001001000
SNP3 01010101000000000000000000000100011111111111"->tx
dat <- data.table::fread(text=tx, header=F)
jay.sf
  • 60,139
  • 8
  • 53
  • 110
1

Another option is to use stringi:

timing code:

nr <- 1e6
nc <- 1e3
l <- rep(paste(rep(1L, nc), collapse=""), nr)
writeLines(l, "test.txt")

cols <- c(1,4,8,10)

library(stringi)
library(iotools)    
microbenchmark::microbenchmark(times=1L,
    stringi=lapply(cols, function(n) stri_sub(l, n, n)),
    iotools=input.file("test.txt", formatter=dstrfw, 
        col_types=rep("character", nc), widths=rep(1L, nc))[, cols]
)

timings:

Unit: seconds
    expr       min        lq      mean    median        uq       max neval
 stringi  1.329223  1.329223  1.329223  1.329223  1.329223  1.329223     1
 iotools 76.250773 76.250773 76.250773 76.250773 76.250773 76.250773     1
chinsoon12
  • 25,005
  • 4
  • 25
  • 35