0

I am reading in around 50 csv files (same naming convention, same structure, each file approx 150k rows). I then further want to merge all files, but I want to be able to identify the original source for each row.

My solution so far was to read in all data based on list.files and then merge them together with rbindlist and the idcol argument. But I have troubles setting the idcol argument so that it takes the name of the original data.table. In addition I am somehow struggling to define my tables as valid list for rbindlist.

    #get filenames + path
    temp=list.files(path="C:/LocalData",pattern="RV_*",full.names=TRUE)

    #get filenames without path
    temp2=list.files(path="C:/LocalData",pattern="RV_*",full.names=FALSE)

    # get a substring of names to create a new list for the tbl names
    filenames=sapply(temp2,function(x) substr(x,1,5)) 

    #read in all files via fread and store it as an own data.table
    for (i in 1:length(temp)) assign(filenames[i], fread(temp[i])) 

    #now bring all data.tables together and create a new column that indicates the source
    RV=rbindlist(as.list(filenames),idcol = TRUE)
      Error in rbindlist(as.list(filenames), idcol = TRUE) : 
      Item 1 of list input is not a data.frame, data.table or list

    #if I state the dts individually it works
    RV=rbindlist(list(RV_v1,RV_v2,RV_v3,RV_v4,RV_v5),idcol = TRUE)

How can I define a list for rbindlist based on my "filenames" variable?

In addition - instead of having only a numeric value at the newly created.id column I would like to have the values of the original data.table e.g. RV_v1 and RV_v2 how can I achieve that?

> RV[6:15]
    .id Identifier Name       Value
 1:   1          F   AF 68,77523568
 2:   1          G   AG 30,28675331
 3:   2          A   AA 71,38992413
 4:   2          B   AB 86,87556292
 5:   2          C   AC 60,81629287
 6:   2          D   AD 5,815721308
 7:   2          E   AE  11,9030038
 8:   2          F   AF 56,28142304
 9:   2          G   AG 3,291405727
10:   3          A   AA 59,62673465
> 

A similar question was already posed and answered at In R, add NEW column to MULTIPLE df using df names but I was not able to modify it in a way so it would work for me ..

In order to be able to reproduce my problem I uploaded a sample of 5 csv files. https://www.dropbox.com/s/qst2rgjkb0kpori/RVs.zip?dl=0 Thank you in advance!

Edit: as suggested by Frank

rbindlist(lapply(setNames(temp, substr(temp2, 1, 5)), fread), idcol=TRUE)

works fine to do what I wanted to do. Thx!

Edje
  • 1
  • 3

0 Answers0