I'm a newbie to the bigstatsr package. I have a sqlite database which I want to convert to an FBM matrix of 40k rows (genes) 60K columns (samples) for later consumption. I found examples of how to populate the matrix with random values but I'm not sure of what would be the best way to populate it with values from my sqlite database.
Currently I do it sequentially, here's some mock code:
library(bigstatsr)
library(RSQLite)
library(dplyr)
number_genes <- 50e3
number_samples <- 70e3
large_genomic_matrix <- bigstatsr::FBM(nrow = number_genes,
ncol = number_samples,
type = "double",
backingfile = "fbm_large_genomic_matrix")
# Code to get a single df at the time
database_connection <- dbConnect(RSQLite::SQLite(), "database.sqlite")
sample_index_counter <- 1
for(current_sample in vector_with_sample_names){
sqlite_df <- DBI::dbListTables(conn = database_connection) %>%
dplyr::tbl("genomic_data") %>%
dplyr::filter(sample == current_sample) %>%
dplyr::collect()
large_genomic_matrix[, sample_index_counter] <- sqlite_df$value
sample_index_counter <- sample_index_counter + 1
}
big_write(large_genomic_matrix, "large_genomic_matrix.out", every_nrow = 1000, progress = interactive())
I have two questions:
- Is there a way of populating the matrix more efficiently? Not sure if big_apply could be used here, perhaps foreach
- Do I always have to use big_write in order to load my matrix later? If so why can't I just use the bk file?
Thanks in advance