I have a data frame that looks like this:
SNP1 01010101000000100000010010001010011001010101
SNP2 01010010101000100000000000000001100001001000
SNP3 01010101000000000000000000000100011111111111
... but that in reality contains ~8 million rows, and each binary vector is of length 1000 each.
I need to select specific positions in these binary vectors (across all rows). The dirty way I found to do this was to remove row names, convert each digit into a column, and then create an object containing the positions I am interested.
The following works fine with sample data, but it is not very efficient with my real data (it's running for a long time now). Any ideas how I can make it faster?
library(data.table)
library(stringr)
setwd("test/")
DATADIR="datadir/"
OUTPUTDIR="outputdir/"
dir.create(OUTPUTDIR, showWarnings = FALSE)
baseline<-read.table(paste0(DATADIR,"input.file"), colClasses = "character")
# Pass BP name to row name (so that I can split the binary vector into multiple columns)
row.names(baseline) <- baseline$V1
baseline$V1 <- NULL
# split cells containing the binary vectors into multiple columns - thank you @Onyambu for this!
baseline_new <- read.table(text = gsub('(.)','\\1 ',baseline$V2),fill=TRUE)
# select columns of interest
columns_to_keep <- c(1, 4, 8, 10)
baseline_new_ss <- baseline_new[, columns_to_keep]
# create new object containing a column with the original row names, then recreate binary vector based on subsetted binary positions.
baseline_final <- as.data.frame(row.names(baseline))
baseline_final$V2 <- as.character(interaction(baseline_new_ss,sep=""))
Output (selecting only positions 1, 4, 8 and 10) should look like:
SNP1 0110
SNP2 0100
SNP3 0110
I am sure there's a less convoluted way of doing this.
Thank you!!