0

I am currently working on thousands of files (with .MOD extension) where I want to extract specific information from all these files. These information will then be collected into one excel sheet in such a way that each row represents information extracted from one .MOD file. I have managed to do this.

However, there are lets say about 10-20 files (out of the tens of thousands) that do not contain information in the format that I want, and this therefore throws an error. I cannot of course manually keep digging into all the files, or cannot subset them each time to find which of these files is throwing the error. Therefore, I want to include a tryCatch() function, so that the script still continues to run without stopping. For the files that give error, I simply want the values to be replaced by "Error" in those specific cells. Can anyone help me how to do that?

Following is how I want my final excel output to look like:

ID  COL1    COL2    COL3    COL4    COL5    COL6    COL7    COL8
Sample1 9-5-2014    10:42:41    600 1207    3   2   62  30
Sample2 8-1-2013    08:44:50    654 1873    1   7   60  45
Sample3 2-3-2013    14:47:40    767 1645    1   18  66  37
Sample4 8-2-2013    08:50:45    727 1500    1   8   68  45
Sample5 4-1-2013    13:08:49    Error   Error   Error   Error   Error   Error
Sample6 1-2-2013    13:08:47    720 1433    1   16  60  51
Sample7 3-4-2013    13:59:04    610 1343    2   13  66  32

Following is my code (along with the error):

AR.MOD.files <- list.files(pattern = "AR.MOD|ar.MOD")
    for (fileName in AR.MOD.files) {
    AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
    AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
    names(AR.MOD.subset1) <- c("COL1", "COL2")
    AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
    names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
    AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
    ID <- basename(fileName)
    AR.MOD.final <- merge (ID, AR.MOD.final)
    colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"
    if(match(fileName,AR.MOD.files)==1){
            output.AR.MOD <- AR.MOD.final
        }else{
            output.AR.MOD <- rbind(output.AR.MOD,AR.MOD.final)}
        }
Error in `[.data.frame`(AR.MOD, c(3), 3:8) : undefined columns selected
    output.AR.MOD$ID <- gsub("AR.MOD", "", paste(output.AR.MOD$ID))
    output.AR.MOD$ID <- gsub("ar.MOD", "", paste(output.AR.MOD$ID))
    print(output.AR.MOD)

I here share 2 example files:

> AR.MOD <- read.table("Sample1ar.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2        V3       V4   V5    V6   V7    V8
1 Case  1 23-3-2013 14:47:40                      
2  Run NA                                         
3    R  1    767,96  1647,72 1,78 18,88 0,66 37,33

> AR.MOD <- read.table("Sample2AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2       V3       V4   V5   V6   V7    V8
1 Case  1 9-5-2014 10:42:41                     
2  Run NA                                       
3    R  1   566,47  1207,22 3,05 2,95 0,62 30,00

It works with the above 2 examples. However, if one of the column is missing, lets say in the following, then it throws error.

> AR.MOD <- read.table("Sample3AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2        V3      V4   V5   V6   V7
1 Case  1 28-1-2013 8:44:50                     
2  Run NA                                       
3    R  1    783,76 1873,70 1,34 7,48 0,60

I am at this point not sure which file it is coming from, but I here send you a dummy example in the 3rd sample from above. I am not able to attach files directly here, that is why I read it and send you as an output.

Letin
  • 1,255
  • 5
  • 20
  • 36
  • Where exactly does your error occur? It isn't clear to me at which point the process is producing an error. However ... *strongly* urge you to not use `rbind` iteratively, its performance will scale very poorly, and if you have 10k files, you will know very quickly (and wait a long time). Suggest wrapping this in `lapply` and then do *once*: `do.call(rbind.data.frame, list_of_frames)`. This might help resolve some of your errant-file problems, too. – r2evans Sep 06 '19 at 15:12
  • Thanks for letting me know about rbind. I will work on that. Regarding the error, it shows exactly after the step where I showed in the code. But it probably comes from here "AR.MOD.subset2 <- AR.MOD[c(3), 3:8]" because in those files with errors, there are no columns from 3:8. Probably 1 or 2 columns in those files are missing. – Letin Sep 06 '19 at 15:26

2 Answers2

1

I'd echo the lapply solution to make the tables in individual list elements and then handle the combination afterwards. Here is an example using the data.table package that fills the data with NA's where it can't find it:

# # for installing:
# install.packages(data.table)
library(data.table)

# generate tables with uneven columns
set.seed(1)
tables <- lapply(1:10, function(i){
  ncols <- sample(1:5, 1, 1)
  out <- as.data.frame(matrix(runif(ncols), nrow=1, ncol=ncols))
})

# you can use rbindlist with fill=TRUE to fill the bad values with NA
output <- as.data.frame(rbindlist(tables, fill=TRUE))

EDIT: I can't be certain this will work off the bat, but give it a try:

# # for installing:
# install.packages(data.table)
library(data.table)

# Set this to what you expect max to be 
ncol_total <- 9
tables <- lapply(AR.MOD.files, function(fileName){
  AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
  AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
  names(AR.MOD.subset1) <- c("COL1", "COL2")
  AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
  names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
  AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
  ID <- basename(fileName)
  AR.MOD.final <- merge (ID, AR.MOD.final)
  colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"

  # add in missing data
  ncol_file <- ncol(AR.MOD.final)
  missing <- ncol_total - ncol_file
  if(missing > 0){
    new_data <- as.data.frame(matrix("Error", nrow=nrow(AR.MOD.final), ncol=missing))
    AR.MOD.final <- cbind(AR.MOD.final, AR.MOD.final)
  }

  AR.MOD.final
})

# this will likely screw up the column names. Its better to know what these
# are and assign after, as long as the tables are all in the same order
output <- as.data.frame(rbindlist(tables, use.names = FALSE))
names(output) <- c("ID", "COL1", "COL2", "COL3", "COL4", "COL5", "COL6", "COL7"
                   "COL8")

# continuing on
output$ID <- gsub("AR.MOD", "", paste(output$ID))
output$ID <- gsub("ar.MOD", "", paste(output$ID))
print(output)
Jonny Phelps
  • 2,687
  • 1
  • 11
  • 20
  • If the user doesn't need `data.table` for other parts, then `rbindlist` can be replaced (with minor nuances) with `do.call(rbind.data.frame, tables)`. – r2evans Sep 06 '19 at 15:30
  • That errors for me due to the columns not matching. Does it have an argument similar to `fill` in `rbindlist`? – Jonny Phelps Sep 06 '19 at 15:31
  • https://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns for other alternatives. Edit: its a bit old now, its `dplyr` now not `plyr` for top solution – Jonny Phelps Sep 06 '19 at 15:32
  • @Jonny: Will you be able to implement your code with my full code just so I can try it as it is, rather than making mistakes? Also, is it possible to add something else other than NA (may be "Error")? I do have NAs in the excel file due to other reasons. Thats why I want to separate these with other names in order to know that is due to the error. – Letin Sep 06 '19 at 15:34
  • yeah sure, edited my answer. Ah wait, just seen what you mean, ignore my edit – Jonny Phelps Sep 06 '19 at 15:46
  • Ok might work now. I've moved towards adding in the data to each lapply if we can test its missing by checking the number of columns. – Jonny Phelps Sep 06 '19 at 15:57
  • Thanks so much for editing. I just tried the part until AR.MOD.final }) It is still giving me the same error. – Letin Sep 06 '19 at 16:12
  • Jonny: if you need `fill=` then that sounds like your files are not structured identically. Letin, can you provide two file samples (with a few rows each)? Just something representative for testing. – r2evans Sep 06 '19 at 16:15
  • @Jonny and r2evans: Really thank you for your efforts. I have edited my question for you to see the files that I have. – Letin Sep 06 '19 at 16:34
1

Here's an approach.

writeLines("a,b\n1,2", "Letin_good.csv")
writeLines("", "Letin_bad1.csv")
writeLines("c,d\n3,4", "Letin_bad2.csv")

# myfiles <- list.files(pattern = "Letin.*\\.csv", full.names = TRUE)
myfiles <- c("Letin_good.csv", "Letin_good.csv", "Letin_bad1.csv", "Letin_good.csv", "Letin_bad2.csv")

datlist <- lapply(myfiles, function(fn) {
  tryCatch({
    out <- read.csv(fn, header=TRUE, stringsAsFactors=FALSE)
    # do something with the data
    out
  },
    error = function(e) NULL)
})
str(datlist)
# List of 5
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: int 1
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: int 1
#   ..$ b: int 2
#  $ : NULL
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: int 1
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ c: int 3
#   ..$ d: int 4

At this point, the third element is clearly wrong (read.csv failed) and the fifth element is incorrect (wrong headers). We can generate a filter of sorts that returns TRUE if all "conditions" are met (e.g., all required names present):

gooddatlist <- Filter(function(x) {
  all(
    c("a", "b") %in% names(x)
    # other tests
  )
}, datlist)
str(gooddatlist)
# List of 3
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: num 12
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: num 12
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: num 12
#   ..$ b: int 2

alldat <- do.call(rbind, gooddatlist)
r2evans
  • 141,215
  • 6
  • 77
  • 149