I would like to read a number of large csv files and bind them but only combine the unique values as there are duplicates that appear across different files.
Previously I used read.csv
following by rbind
but this was very slow.
I am now using:
files <- list.files('C:/Users/Michael/documents')
df <- rbind.fill(lapply(files, fread, header=TRUE))
Each file contains about 300,000 records and there are approx. 15 files bringing the total to about 4m records.
Once complete, I use:
df <- df[!duplicated(df$UniqueID), ]
which reduces the number of records to approx. 3m.
My questions:
- How can I
rbind
only the unique values and thus eliminate the last step in the process? - Is this the most efficient and fastest way to complete this task?
- It seems that I have to set the workig directory just prior to running the above otherwise I get this error:
File "file name" does not exist or is non-readable.