0

I would like to read a number of large csv files and bind them but only combine the unique values as there are duplicates that appear across different files.

Previously I used read.csv following by rbind but this was very slow.

I am now using:

files <- list.files('C:/Users/Michael/documents')
df <- rbind.fill(lapply(files, fread, header=TRUE))

Each file contains about 300,000 records and there are approx. 15 files bringing the total to about 4m records.

Once complete, I use:

df <- df[!duplicated(df$UniqueID), ]

which reduces the number of records to approx. 3m.

My questions:

  1. How can I rbind only the unique values and thus eliminate the last step in the process?
  2. Is this the most efficient and fastest way to complete this task?
  3. It seems that I have to set the workig directory just prior to running the above otherwise I get this error:

File "file name" does not exist or is non-readable.

Michael
  • 221
  • 1
  • 8
  • 2
    I don't think there is a way to remove duplicates as you read in the data. The only thing I could think of is after you read each individual file in dedupe then append. This would probably require a loop of sorts. Now that you hopefully have a smaller dataframe you can dedupe that. Im not sure if this would be more efficient, you'd have to test it using Sys.time(). As for the fastest deduping proccess check this out: https://stackoverflow.com/questions/37148567/fastest-way-to-remove-all-duplicates-in-r – megmac Jan 04 '23 at 00:02
  • 1
    If you set `full.names = TRUE` in your `list.files` command you won't need to set the working directory. – Gregor Thomas Jan 04 '23 at 01:02

0 Answers0