0

I have 1500 files with the same format (the .scount file format from PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount), an example is below:

#IID    HOM_REF_CT  HOM_ALT_SNP_CT  HET_SNP_CT  DIPLOID_TRANSITION_CT   DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT   DIPLOID_SINGLETON_CT    HAP_REF_INCL_FEMALE_Y_CT    HAP_ALT_INCL_FEMALE_Y_CT    MISSING_INCL_FEMALE_Y_CT
LP5987245   10  0   6   53  0   52  0   67  70  32
LP098324    34  51  10  37  100 12  59  11  49  0
LP908325    0   45  39  54  68  48  51  58  31  2
LP0932325   7   72  0   2   92  64  13  52  0   100
LP08324 92  93  95  39  23  0   27  75  49  14
LP034252    85  46  10  69  20  8   80  81  94  23

In reality each file has 80000 IIDs and is roughly 1-10MB in size. Each IID is unique and found once per file.

I would like to create a single file matched by IID with each column value summed. The column names are the same across files.

I have tried:

fnames <- list.files(pattern = "\\.scount")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
x <- aggregate(IID ~ , data = df_all, sum)

But this is really slow for the number of files and the # at the start of the #IID column is a real pain to work around.

Any help would be greatly appreciated

tacrolimus
  • 500
  • 2
  • 12

2 Answers2

1

a tidyverse solution

df2 <- df
df3 <- df

df_list <- list(df,df2,df3)

df_all <- do.call(rbind, df_list)

library(dplyr)

df_all %>%
group_by(IID) %>%
summarise_all(sum)

solution with data.table

df_list <- list(df,df2,df3)

df_all <- do.call(rbind, df_list)

library(data.table)

setDT(df_all)
df_all[, lapply(.SD, sum), by=IID]

to ignore '#' see Cannot read file with "#" and space using read.table or read.csv in R

Samet Sökel
  • 2,515
  • 6
  • 21
1

When working with a lot of files, parallel reading might be faster

library(data.table)
library(parallel)

# Get locations of files to read
fnames <- list.files(pattern = "\\.scount", full.names = TRUE)
# Get column names from first file (assuming all files have the same number of columns)
cnames <- as.vector(t(data.table::fread(fnames[1], nrows = 1, header = FALSE)))

# Initiate cluster
# Calculate the number of cores available (or hard code)
no_cores <- parallel::detectCores()
# Initiate cluster
cl <- parallel::makeCluster(no_cores)

# Read in files parallel, rowbind
DT <- data.table::rbindlist( 
  parallel::parLapply(cl, fnames, data.table::fread(skip = 1, header = FALSE )),
  use.names = TRUE, fill = TRUE)
# Stop cluster
parallel::stopCluster(cl)

# Set column names
data.table::setnames(DT, old = names(DT), new = cnames)
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • thanks for taking the time to answer. After the cnames function I get: Error in data.table::fread(text, nrows = 1, header = FALSE) : input = must be a single character string containing a file name, a system command containing at least one space......... – tacrolimus Sep 07 '21 at 13:00
  • 1
    apologies.. is fixed, see code above... It should get the headers from the first line in the first file... `text` was my testdata ;-) – Wimpel Sep 07 '21 at 13:10
  • thanks for correcting the code! Works really well. – tacrolimus Sep 07 '21 at 13:20