0

I have a folder of .txt files, each has a long string names such as "ctrl_Jack_DrugA_XXuM.txt". However the name is missing an important string, timestamps.

However, I have that information in the dataframe inside each file. for example, in each file, contains multiple columns, one of the column is called "Pid_treatmentsum": the elements in it is "Jack_R4_200514_DrugA_XXuM.txt"

So before I proceed to downstream I want to sort the files out into subfolders based on the names such as Jack and timestamp such as "R4_200514", and in order to do that I need to rename the file title with "Pid_treatmentsum".

Now the code:

```
#create MRE
#file 1
Row <- c(rep("16", 20))
column <- c(rep("3", 20))
Pid<- c(rep("Jack", 20))
Stimulation<- c(rep("3S", 20))
Drug <- c(rep("2DG", 20))
Dose <-c(rep("3uM", 20))
Treatmentsum <-c(rep(paste("Jack","3S",'2DG','3uM',sep = "_"), 20))
PiD_treatmentsum <- c(rep(paste('Jack',"T4_20200501",'3S','2DG','3uM',sep = "_"), 20))
sampleset <-data.frame(Row,column,Pid,Stimulation,Drug,Dose,Treatmentsum,PiD_treatmentsum)
write.table(sampleset, file = "ctrl_Jack_3S_2DG_3uM.txt",sep="\t", row.names = F, col.names = T)

#file 2
Row <- c(rep("16", 40))
column <- c(rep("3", 40))
Pid<- c(rep("Mark", 40))
Stimulation<- c(rep("3S", 40))
Drug <- c(rep("STS", 40))
Dose <-c(rep("1uM", 40))
Treatmentsum <-c(rep(paste("Mark","3S",'STS','1uM',sep = "_"), 40))
PiD_treatmentsum <- c(rep(paste('Mark',"T5_20200501",'3S','STS','1uM',sep = "_"), 40))
sampleset <-data.frame(Row,column,Pid,Stimulation,Drug,Dose,Treatmentsum,PiD_treatmentsum)
write.table(sampleset, file = "ctrl_Mark_3S_STS_1uM.txt",sep="\t", row.names = F,col.names = T)

# rename all the files using their PiD_treatmentsum 
filenames <- list.files("C:/UsersXXX", pattern="*.txt")
outdirectory <- "~/out"
lapply(filenames, function(x) {
df <- read.csv(x,sep="\t", header=TRUE, fill = T,stringsAsFactors = F)
a <- as.character(unique(df[["PiD_treatmentsum"]]))
b<-paste0("ctrl_",a, '.txt', sep="")
newname <- file.rename(basename(x), b)
write.table(df, paste0(outdirectory,"/", newname, sep="\t", 
          quote=FALSE, row.names=F, col.names=TRUE)
})

Here it says error in unexpected }. I think I must have screwed up the loop.

If I just dissect the code and run one file as an example, the code works:

  df <- read.csv('ctrl_Jack_3S_2DG_3uM.txt',sep="\t", header=TRUE, 
             fill = T,stringsAsFactors=F)

  a <- as.character(unique(df[["PiD_treatmentsum"]]))
  b<-paste0("ctrl_",a, '.txt', sep="")
  basename('ctrl_Jack_3S_2DG_3uM.txt')
  file.rename(basename('ctrl_Jack_3S_2DG_3uM.txt'), b)

```

A little help and explanation will be appreciated :)

ML33M
  • 341
  • 2
  • 19
  • `df$Pid_treatmentsum` is a column of the data.frame `df` and not a string Depending on the content of `df` you can try `newfilename <-df$Pid_treatmentsum[1]` – dario Feb 24 '20 at 20:08
  • Hi @dario, I have tried your suggestion, as all elements in that column for each file is identical, so I'm happy with indexing any one of them. However, the file.renames still gives me the same error warning: incalid 'to' argument – ML33M Feb 24 '20 at 20:12
  • What is the value of `newfilename`?? Please edit your question and add the output of `head(df)`, as well as the value of `x`, `newfilename` and `outputdirectory` for an value of `x` that raises the error – dario Feb 24 '20 at 20:15
  • Hi @dario, I have break the loop and just run file in the folder with the code and checking what you ask for. The newfilename value is a Factor with 1 level. – ML33M Feb 24 '20 at 20:28
  • @dario editted as above. I hope this helps – ML33M Feb 24 '20 at 20:34
  • 1.Add `stringsAsFactors=FALSE` to the `read.csv` call. 2. `basename(df)` looks suspicious:. `df` is the data.frame object? Or not?? A data.frame wouldn't have a `basename` method... change this to `basename("ctrl_Jack_DrugA_XXuM.txt")`3. when we call `file.rename` we must not forget to add the file extension to the string we got from `df$Pid_treatmentsum[1]`. So a change to `paste0(newfilename, ".txt")` would be appropriate.. – dario Feb 24 '20 at 20:39
  • Yes @dario, The StringasFacotrs=F fixed the newfilename, it is a string now. I found the df problem whe editting the question, and indeed df is a dataframe object. I changed it to "'ctrl_Jack_DrugA_XXuM.txt'", however now if I ran it, it returns FALSE instead of renameing – ML33M Feb 24 '20 at 20:46
  • Could you please provide a MRE? Its way easier to troubleshoot that way. Last idea without that is only to remove/check `file_path_sans_ext` (no idea why this is called?!) – dario Feb 24 '20 at 20:51
  • @dario, you mind explain me what is a MRE? Sorry , rookie here. I used the file_path_sans_ext to get only the name of the file without its extension – ML33M Feb 24 '20 at 20:55
  • Yea, Thats it then. Remove that part of the code. Then it will work. [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610).for information on how to get a minimal reproducible example. Providing a MRE makes it easier for others to help you! – dario Feb 24 '20 at 20:56
  • @dario. The code ran after removing the file_path_sans_ext, however the renamed file lost it extension '.txt'.file I tried to used this but it returnes FALSE.rename(basename('ctrl_Jack_DrugA_XXuM.txt'), paste0(newfilename,'.txt')) – ML33M Feb 24 '20 at 21:15
  • Did you do the: 3. when we call file.rename we must not forget to add the file extension to the string we got from df$Pid_treatmentsum[1]. So a change to paste0(newfilename, ".txt") would be appropriate Part??? – dario Feb 24 '20 at 21:24
  • @dario yes man, after reading the df, now I do this newfilename <-paste0(df$Pid_treatmentsum[1], '.txt', sep=""), and checked that newfilename is the correct string with .txt. Then I ran the file.rename(basename('ctrl_S3174542__3S_PDB_100ngml_none.txt'), newfilename). R returns false. I'm confused – ML33M Feb 24 '20 at 21:28
  • I don't think we are able to help you if you don't provide an MRE. No need for the acutal files, you can provide some code to create them, something like `write.csv(data.frame(Pid_treatmentsum="Jack_R4_200514_DrugA_XXuM"). file = "ctrl_Jack_DrugA_XXuM.txt", row.names = FALSE)`). But it must be minimal reproducible. – dario Feb 24 '20 at 21:37
  • @dario okay my friend. Let me try to create some – ML33M Feb 24 '20 at 21:40
  • @dario hi, sorry maybe my edit isn't very neat, but they work. You would be able to see 2 sample data files and the code to rename them. I think I'm close to get the code right, it is could be the loop/function that screwed up. Also, in case I have a huge folder with massive individual files, would my code be the most efficient? – ML33M Feb 24 '20 at 22:36
  • Good job with the MRE! – dario Feb 24 '20 at 22:44
  • What is the purpose of the final command ` write.table(df, paste0(outdirectory,"/", newname, sep="\t", quote=FALSE, row.names=F, col.names=TRUE)`? The variable `newname` is `TRUE` ... that does not make sense- To move the renamed file we could use `file.copy`. – dario Feb 24 '20 at 22:45
  • @dario Hi I was just trying to rename the files and move them into a different folder, so the original untouched files are still there, in case I/others need them for different purpose. In practice, the actually folder contains ~900,000 txt files, each of size 4-9MB. so I also want to make sure the code is fast and efficient. – ML33M Feb 24 '20 at 22:49
  • I posted an answer. I was able to successfully rename and copy the files. – dario Feb 24 '20 at 22:55

1 Answers1

1

This should work:

create MRE
#file 1
Row <- c(rep("16", 20))
column <- c(rep("3", 20))
Pid<- c(rep("Jack", 20))
Stimulation<- c(rep("3S", 20))
Drug <- c(rep("2DG", 20))
Dose <-c(rep("3uM", 20))
Treatmentsum <-c(rep(paste("Jack","3S",'2DG','3uM',sep = "_"), 20))
PiD_treatmentsum <- c(rep(paste('Jack',"T4_20200501",'3S','2DG','3uM',sep = "_"), 20))
sampleset <-data.frame(Row,column,Pid,Stimulation,Drug,Dose,Treatmentsum,PiD_treatmentsum)
write.table(sampleset, file = "ctrl_Jack_3S_2DG_3uM.txt",sep="\t", row.names = F, col.names = T)

#file 2
Row <- c(rep("16", 40))
column <- c(rep("3", 40))
Pid<- c(rep("Mark", 40))
Stimulation<- c(rep("3S", 40))
Drug <- c(rep("STS", 40))
Dose <-c(rep("1uM", 40))
Treatmentsum <-c(rep(paste("Mark","3S",'STS','1uM',sep = "_"), 40))
PiD_treatmentsum <- c(rep(paste('Mark',"T5_20200501",'3S','STS','1uM',sep = "_"), 40))
sampleset <-data.frame(Row,column,Pid,Stimulation,Drug,Dose,Treatmentsum,PiD_treatmentsum)
write.table(sampleset, file = "ctrl_Mark_3S_STS_1uM.txt",sep="\t", row.names = F,col.names = T)

I only changed the last three lines. We rename the file using file.rename (newname is now TRUE or FALSE if there was an error while renaming)

Then we create outdirectory (it will raise a warning if dir already exists, but nothing will be overwritten. We could test first if outdir already exists and if so omit the dir.create)

Finally we use file.copy to copy the renamed file into outdirectory. We can use file.path to concatenate the directory and filename.

# rename all the files using their PiD_treatmentsum 
# and copy them to outdirectory
filenames <- list.files(".", pattern="*M\\.txt")
outdirectory <- "~/out"
lapply(filenames, function(x) {
  df <- read.csv(x, sep="\t", header=TRUE, fill = T,stringsAsFactors = F)
  a <- as.character(unique(df[["PiD_treatmentsum"]]))
  b<-paste0("ctrl_",a, '.txt', sep="")
    newname <- file.rename(basename(x), b)
    dir.create(outdirectory)
    file.copy(b, file.path(outdirectory, b))
})

I'd suggest updating the variable names to something more meaningful to make future refactoring easier ;)

dario
  • 6,415
  • 2
  • 12
  • 26
  • Thank you, it's running now, I selected 600 files and let it run, I will see the results tmr morning :) – ML33M Feb 24 '20 at 23:30
  • Best of luck!! ;) – dario Feb 25 '20 at 06:21
  • Excellent! Glad to hear that! – dario Feb 25 '20 at 15:09
  • Hi @dario, there is an error. The code runs okay in the beginning and files are created. But then it reports an error after it processed 33 files (out of 600). Saying"Error in file.rename(basename(x), b) : 'from' and 'to' are of different lengths" – ML33M Feb 25 '20 at 15:11
  • So this error occured while processing `filenames[34]`? What is the value that? And what are the values of `df` `as.character(unique(df[["PiD_treatmentsum"]]))` `outdirectory` – dario Feb 25 '20 at 15:16
  • I'm not entirely sure it is file 34, I guessed based on there are only 33 files there in the output. How do you check the value of the as.characterXXXX? They are in that function, and R just executed the function without showing any value – ML33M Feb 25 '20 at 15:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/208511/discussion-between-ml33m-and-dario). – ML33M Feb 25 '20 at 15:30