Random Sample each datafile in my list before rbind them into a datafram using R

Question

need help on my task. So I have folder of 121 .txt files. Each about 10MB in size. For each .txt file, they have almost exactly the same columns/headers, and various rows. I only found out the difference in column headers later yesterday, and it might result from the machine that generate the .txt file is using lots special characters in the header so when I read them in, funny business happens.

I would like to read all the files in the folder then combine them into one big file for downstream analysis. Now that I have 2 other problems, the size of them and the potential dimension inconsistency made the fread() code failed. I would like to find a function that can properly read in large number of .txt files. Secondly, I want to random sample say 20% of each of the file after read them in, and take that 20% into merging a .csv file for downstream processing. I'm not very new so list operations has always been conceptually challenging so far. And then in the end, the rbind did not work, since some of the file dimensions are inconsistent. I used gtools and the smartbind to get around. But then similar to random sample before creating a massive file, can I also subset column 1 to 131 in each file being read in?

here are my code, that slowly read in all files and combine them into a big .csv. Please educate me.

setwd("C:/Users/mli/Desktop/3S_DMSO")
library(gtools)
# Create list of text files
txt_files_ls = list.files(pattern="*.txt") 
# Read the files in, assuming comma separator
txt_files_df <- lapply(txt_files_ls, function(x) {read.csv(file = x, header = T, sep ="\t")})
# Combine them
combined_df <- do.call("smartbind", lapply(txt_files_df, as.data.frame))

write.csv(combined_df,"3SDMSO_merged.csv",row.names = F)

https://stackoverflow.com/a/58131427/1563960 – webb Jun 16 '20 at 17:57 — webb, Jun 16 '20 at 17:57

score 2 · Accepted Answer · answered Jun 16 '20 at 18:07

2

You might try using the read and write functions from data.table. fread has a really cool auto-start function which intelligently chooses columns and header information.

library(data.table)
setwd("C:/Users/mli/Desktop/3S_DMSO")
txt_files_ls = list.files(pattern="*.txt") 
txt_files_df <- lapply(txt_files_ls, fread)
sampled_txt_files_df <- lapply(txt_files_df,function(x){
  x[sample(1:nrow(x), ceiling(nrow(x) * 0.2)),1:131]
  })
combined_df <- rbindlist(sampled_txt_files_df)
fwrite(combined_df,"3SDMSO_merged.csv",row.names = FALSE)

answered Jun 16 '20 at 18:07

Ian Campbell

23,484
14
36
57

Thank you @Ian Campbell. The code run smoothly. My fread did not run yesterday most likely i did not call the library.... I do get a warning. "Warning message: In sample.int(length(x), size, replace, prob) : '.Random.seed[1]' is not a valid integer, so ignored" but the output file look okay. Shall I be worried? – ML33M Jun 16 '20 at 18:15
1

I don't think I'd worry about the error, I ran some tests and I seemed to get random enough results. – Ian Campbell Jun 16 '20 at 18:19

webb · Answer 2 · 2020-06-16T18:09:46.397

...
txt_files_df <- lapply(txt_files_ls, function(x) {
  # fread with fill=T usually works. if not, go back to read.csv
  fread(file = x, header = T, sep ="\t", fill=T)[sample(round(.2*.N))] # keep 20% of rows
})
# rbindlist with use.names=T,fill=T usually works. if not, preprocess above or go back to smartbind
combined_df <- rbindlist(txt_files_df,use.names=T,fill=T)
## Keep only columns 1 - 131
# if you don't use fread, then convert to data.table so the column selection below works:
# setDT(combined_df)
combined_df = combined_df[,1:131]
...

Need it faster? see https://stackoverflow.com/a/58131427/1563960

Random Sample each datafile in my list before rbind them into a datafram using R

2 Answers2