I have 1500 files with the same format (the .scount file format from PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount), an example is below:
#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
LP5987245 10 0 6 53 0 52 0 67 70 32
LP098324 34 51 10 37 100 12 59 11 49 0
LP908325 0 45 39 54 68 48 51 58 31 2
LP0932325 7 72 0 2 92 64 13 52 0 100
LP08324 92 93 95 39 23 0 27 75 49 14
LP034252 85 46 10 69 20 8 80 81 94 23
In reality each file has 80000 IIDs and is roughly 1-10MB in size. Each IID is unique and found once per file.
I would like to create a single file matched by IID with each column value summed. The column names are the same across files.
I have tried:
fnames <- list.files(pattern = "\\.scount")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
x <- aggregate(IID ~ , data = df_all, sum)
But this is really slow for the number of files and the # at the start of the #IID column is a real pain to work around.
Any help would be greatly appreciated