0

I have a large dataset of unique file IDs and links to download the files. It looks like this:

file_id <- c("id:fghjs12:ws8c7/syx", "id:f7gnsfu:7a6#*s", "id:dug:shxgcvu:6sh")
link <- c("https://www.dynare.org/wp-repo/dynarewp028.pdf", "https://www.dynare.org/wp-repo/dynarewp029.pdf", "https://www.dynare.org/wp-repo/dynarewp020.pdf")
df <- data.frame(file_id, link, stringsAsFactors = FALSE)

I want to download each file using the name of the handle. Some of the links are broken. So I have the following loop to do the task but it's not working..

download_documents <- function(url, file_id) {
   tryCatch(
     {download.file(url, paste0('~/Desktop/Dataset/files/', file_id))}, 
      error = function(e) {NA},
      warning = function(w) {NA})
}
Map(download_documents, df$link, df$file_id)

Does anyone know what I'm doing wrong or have a better solution? Thanks in advance for your help!

Oliver
  • 274
  • 1
  • 11
  • your code is fine it is the ids that are the problem as you cannot save a file with the following characters: \/:*?"<>| Can you create an id system without these characters? – drJones May 04 '20 at 01:22
  • Ah ok that makes sense. I could convert them all to any other character and keep the uniqueness of the IDs I think. Do you know how to do that? Would I use the gsub function maybe? – Oliver May 04 '20 at 01:24
  • you could do something similar to this: https://stackoverflow.com/questions/33949945/replace-multiple-strings-in-one-gsub-or-chartr-statement-in-r but there is a remote possibility that you will end up with non-unique ids. Better to make new unique ids in my opinion. – drJones May 04 '20 at 01:29

1 Answers1

1

You can turn the file_id to valid names using make.names.

Map(download_documents, df$link, make.names(df$file_id))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Good question. Unfortunately when I use basename(url) to name the documents, there’s heaps of overlap (because it’s cutting everything off before the last ‘/‘) so files get overwritten. – Oliver May 04 '20 at 03:12
  • @Oliver Do you mean the names of the file in `df$link` are not unique? Is there some specific way to name the files that you are looking for or any unique way is ok for you? – Ronak Shah May 04 '20 at 03:17
  • Yes, so because the basename function removes all of the path up to and including the last path separator, I am left with file names like 'dynarewp028.pdf' (using the example I posted). Some of the links are just 'file.pdf' because the last separator is right before the word 'file'. It would be fine if the name was the entire link but i can't figure out a way to do that. In any case, it's better to use the unique IDs I already have because they have a format that makes it easier to understand what they represent (e.g. the institution that had published the document, the year, etc.) – Oliver May 04 '20 at 03:19
  • 1
    Ok..updated the answer to turn`file_id` to valid names. Can you check that? – Ronak Shah May 04 '20 at 03:25
  • This is perfect! Thanks so much!! – Oliver May 04 '20 at 03:39