Loop through a large number of data files in a folder to subset each data based on column names

Question

A situation at hand involves several .txt data files saved in a directory. The files have unequal lengths and each file consists of several columns names. The files have an "id" column but the remaining column names are distinct. As an example let's consider the following small scenario, df1 and df2 as the data files in the directory:

df1<-
structure(
list(id = c(1L, 2L, 3L, 4L),
a1=c(10L, 6L, 2L, 8L),
a2 = c(22L, 7L, 5L, 1L),
a3 = c(3L, 12L, 1L, 5L)),
.Names = c("id", "a1", "a2","a3"),
class = "data.frame",
row.names = c(NA,-4L))

df2<-structure(
list(id = c(1L, 2L, 3L),
b1=c(8L, 5L, 4L),
b2 = c(7L, 10L, 11L),
b3 = c(6L, 2L, 1L)),
.Names = c("id", "b1", "b2","b3"),
class = "data.frame",
row.names = c(NA,-3L))

What I intend to do is to subset each data based on some selected column names, say "a1" and "a2" for df1 and "b1" and "b2" for df2.

I tried the following codes:

set(".../")
df1<-read.table("df1.txt", header=T)
df2<-read.table("df2.txt", header=T)

new.df1<-data.frame(df1$a1,df1$a2)
new.df2<-data.frame(df1$b1,df1$b2)

My concern is that this approach is less efficient because there are many data files each with many variables which means I have to repeat the above lines of codes several times. Is there a way to loop through the directory to subset each data based on the relevant column names? Your help is greatly appreciated.

How does one know what are the relevant column names for each datafile? — s_baldur, Sep 02 '19 at 13:51
You can `select` or `drop` columns directly while reading files with `fread` from `data.table` package. It could be a start. See https://stackoverflow.com/questions/5788117/only-read-selected-columns/5788200 — Paul Endymion, Sep 02 '19 at 13:52
@sindri_baldur the relevant column names in the example are "a1" and "a2" for df1 and "b1" and "b2" for df2. — T Richard, Sep 02 '19 at 19:22

Paul Endymion · Answer 1 · 2019-09-02T15:14:40.427

1

From what I understand about your question, this is how I would try to do it. Though it will only work if your columns are always at the same index or share the same name in all your tables.

library(data.table)

# recover file names
list_file <- list.files("path_to_your_files")

# loop over your files, recover only selected columns
list_df <- lapply(list_file, function(x){

  #If your column names are always the same
  fread(x, select = c("a1","a2"))

  #If your column names are always in the same order
  #fread(x, select = c(1,2))

})

What you should recover is a list with all your tables subsetted.

edited Sep 02 '19 at 15:14

answered Sep 02 '19 at 15:07

Paul Endymion

537
3
18

Thanks @Paul Endymion! In the above case the column names vary across the data files and the index for the column names differ for some of the data files. I have followed the link you provided and also done further reading about "fread" to fix this situation but my effort has remain futile. Could you please advise on this situation or give any further suggestions? – T Richard Sep 03 '19 at 01:40
1

Another slightly different approach would be to create a list of vectors that match your column names in each file, such as this line that uses your example : `lapply(letters, function(x){ c(paste(x,"1", sep = ""), paste(x,"2", sep = "")) })`. Making sure this list of pairs of column names respect the order of your list of filenames, you could then loop on the indexes and write the `fread` command like this : `fread(list_file[[x]], select = colnames_list[[x]] )`. Do you have a way to easily build that list of column names ? Is there some determination key ? – Paul Endymion Sep 03 '19 at 12:00
1

@TRichard Did you find any solution to your problem ? – Paul Endymion Sep 09 '19 at 14:53
Much thanks @Paul Endymion. I have not find a solution to my problem yet. I followed your recent approach involving "lapply" and "fread" functions. To select the columns I used lapply(letters[1:2], function(x){ c(paste(x,"1", sep = ""), paste(x,"2", sep = "")) }). With this I suppose I should have "a1", "a2" and "b1", "b2" for the example I provided. However, I have encountered this error " subscript out of bounds" and it is unclear to me why R throws this error. Could you please clarify why this error appears. – T Richard Sep 10 '19 at 00:42
1

@TRichard Well this error indicates you are trying to access an array out of its boundary. Your line works well for me, not sure why it wouldn't for you. Maybe check what `letters[1:2]` returns or try with the latest version of R... – Paul Endymion Sep 10 '19 at 11:43

Loop through a large number of data files in a folder to subset each data based on column names

1 Answers1