0

I need to perform a series of calculations from several files and create a single output table with the results from all files.

I have tried by listing all files within folder and looping, either with for or with lapply functions but there is something I am missing.

Here is a simplified vs of what I have with some 'fake' files:

# Create new folder -- "trials"
setwd("C:/Users/.../Desktop")
dir.create("trials")

# Create 'trial' files
setwd("C:/Users/.../Desktop/trials")
pathFiles <- "C:/Users/.../Desktop/trials"

df_1 <- data.frame(x=c(1,2,3,4,5,6,7,8,9,10))
df_1$y <- c(1,2,3,4,5,6,7,8,9,10)
df_1$z <- c(10,20,30,40,50,60,70,80,90,100)
write.table(df_1, "table1.csv", col.names = TRUE, row.names = FALSE, sep = ",")

df_2 <- data.frame(x=c(2,3,4,5,6,7,8,9,10,11))
df_2$y <- c(2,3,4,5,6,7,8,9,10,11)
df_2$z <- c(20,30,40,50,60,70,80,90,100,110)
write.table(df_2, "table2.csv", col.names = TRUE, row.names = FALSE, sep = ",")

df_3 <- data.frame(x=c(3,4,5,6,7,8,9,10,11,12))
df_3$y <- c(3,4,5,6,7,8,9,10,11,12)
df_3$z <- c(30,40,50,60,70,80,90,100,110,120)
write.table(df_3, "table3.csv", col.names = TRUE, row.names = FALSE, sep = ",")

For each of these files, I want to extract certain information and create an output table with all calculated fields.

I've tried with for loop:

Final <- NULL
M <- NULL
slp <- NULL
eval <- NULL

dfs <- dir(pathFiles, "*.csv", full.names = TRUE, ignore.case = TRUE, all.files = TRUE)

for (df in dfs) {

  t <- read.csv(df, header = TRUE, sep = ",")
  x <- t$x
  y <- t$y
  z <- t$z

  lim_y <- y >= 3 & y <=6
  lim_x <- x[lim_y]
  lim_z <- z[lim_y]

  iFinal <- x[nrow(t)]
  Final <- c(Final, iFinal) # add value to the string

  iM <- mean(lim_z)
  M <- c(M, iM) # add value to the string

  p <- lm(lim_x ~ lim_z)
  iSlp <- summary(p)$coefficients[2,1]
  slp <- c(slp, iSlp) # add value to the string

  ifelse ((Slp <= 0.05 & Slp >= -0.05), ieval <- "ok", ieval <- "false") 
  eval <- c(eval, ieval) # add value to the string
}

sum_df <- data.frame(df, M, Slp, eval, Final)
write.table(sum_df, "sum_df.csv", sep = ",", row.names = FALSE, col.names = TRUE)

I have used this for loop in a similar way before and it worked fine, but not here.

With an lapply function I do not get better results:

dfs <- list.files(pathFiles, "^.+\\.csv", full.names = TRUE, ignore.case = TRUE, all.files = TRUE)

Final <- NULL
M <- NULL
slp <- NULL
eval <- NULL

model <- function(x){
  t <- read.csv(x, header = TRUE, sep = ",")
  x <- t$x
  y <- t$y
  z <- t$z

  lim_y <- y >= 3 & y <=6
  lim_x <- x[lim_y]
  lim_z <- z[lim_y]

  iFinal <- x[nrow(t)]
  Final <- c(Final, iFinal)

  iM <- mean(lim_z)
  M <- c(M, iM)

  p <- lm(lim_x ~ lim_z)
  iSlp <- summary(p)$coefficients[2,1]
  slp <- c(slp, iSlp)

  ifelse ((Slp <= 0.05 & Slp >= -0.05), ieval <- "ok", ieval <- "false") 
  eval <- c(eval, ieval)
}

lapply(dfs, model)

The functions and output table works fine with just one file, so I guess the error must be in how I am looping through the files. But I don't know where I'm going wrong.

I would appreciate any help.

ebb
  • 274
  • 2
  • 6
  • 16
  • 1
    I've just run your for loop, and the only issue I see is a case issue with `slp` and `Slp`. After adjusting that on my end, the loop completed successfully. Otherwise, check to ensure that how you have `pathFiles` defined is a legitimate directory, and that your files actually do exist in that directory. – 93i7hdjb Apr 05 '18 at 18:20
  • I changed that as you suggested but it is not working for me. I get the following error message: `Error in data.frame(df, M, Slp, eval, Final) : arguments imply differing number of rows: 1, 0`. All files are in the pointed directory, so not sure what I am doing wrong. – ebb Apr 05 '18 at 19:32
  • Good completely-reproducible question, by the way. It made it easier to give it a quick try. – r2evans Apr 05 '18 at 22:26

2 Answers2

1

I would recommend rbindlist in the data.table library.

lapply will return a list of length files, rbindlist this list together into a single table

library(data.table)
files <- dir(pathFiles, "*.csv", full.names = TRUE, ignore.case = TRUE, all.files = TRUE)
desiredTable <- rbindlist(
                          lapply(
                                 files,
                                 function(x){
                                 fileData <- fread(x)
                                 CalculatedData <- ...do stuff...
                                 return(CalculatedData)
                                 }
                                )
                            )

Here is a working example using do.call, avoiding the use of data.table

numFiles <- 100 #number of random files to generate

# Generate a bunch of .csv with a fileID, some letters, and some numbers and put those files in the working dir
sapply(
  1:numFiles,
  function(f){
    dataReplicates <- 12
    dataLetters <- sample(LETTERS,12)
    dataNumbers <- sample(seq(1:100),12)
    fileID <- rep(f,dataReplicates)
    fileData <- cbind(
      fileID,
      dataLetters,
      dataNumbers
    )
    write.csv(
      fileData,
      paste0(getwd(),"/",Sys.Date(),"_",f,".csv"),
      row.names = FALSE
    )
   }
  )

# Read those files back in and store the names in a vector
thoseRandFiles <- dir(
  path = getwd(),
  pattern = as.character(Sys.Date()),
  full.names = TRUE
)

#using lapply and rbind, read in each file, perform operations, and bind into a single table
desiredTable <- do.call(
      rbind,
       lapply(
       thoseRandFiles,
        function(x){
         fileData <- read.csv(x)
         fileID <- fileData$fileID[1]
         firstLetter <- as.character(fileData$dataLetters[1])
         sumNumbers <- sum(fileData$dataNumbers)
         calData <- cbind.data.frame(fileID,firstLetter,sumNumbers)
         return(calData)
         }
       )
      )
SubstantiaN
  • 373
  • 4
  • 14
  • 2
    Lacking the want/need for a `DT` object, perhaps just `do.call(rbind, lapply(...))`. – r2evans Apr 05 '18 at 18:44
  • 1
    @r2evans you're right. I tend to operate with `DT` almost by default. Your recommendation would accomplish the same in base – SubstantiaN Apr 05 '18 at 18:54
  • @SubstantiaN and @r2evans thank you so much. Still missing something though. Maybe is the way I am entering the queries. It says "args" is missing. Is the first time I use `function(x)` and probably not structuring it well. I'll have to play with it a little. – ebb Apr 05 '18 at 19:44
  • @r2evans, I hate to ask for this but would you mind posting the code as you would to run the script? I keep getting the `argument "args" is missing, with no default` error message, and it feels I am going in circles. I know it's basic, so I'd understand if you don't. Thanks in advance. – ebb Apr 05 '18 at 20:45
  • @ SubstantiaN thank you very much. Nice approach! Following your suggestions, I did something quite similar but defining the `function` separately. However, the table I was getting showed the last file three times instead of one set of results for each file. I guess it had to do with how I was applying the `combine` function. Thanks again; this was so helpful. – ebb Apr 05 '18 at 22:29
1

You reference slp and Slp, so there's a typo somewhere. Doing a global replacement for one of them fixes that bug.

Your for loop does not produce an error for me.

Your lapply is wrong on a few notes:

  • in general, one of the benefits of using *apply functions is that they work without side-effect, which is what you are doing within your for loop, and what you are setting up for when you globally assign eval and friends. Without trying to reach "out" from within lapply and assigning to the variables in the global namespace, your assignments to M and friends are silently discarded when the function exits. When you think of using these apply functions (and they are great), you should almost always assume that their universe completely disappears when the function exits, and they cannot exit. If you're a Trekkie, think of Remember Me (Star Trek TNG), where Beverly's universe is only what is within the bubble. (It is possible to pierce it, both in R and in the tv show.)

  • Your function returns only eval, and that is only by accident. If you want to return all of the things you've highlighted as "interesting", then you need to explicitly return them, perhaps as a list or data.frame. (Not vector, as ieval will up-convert all vars to character.)

So don't think about concatenating data within lapply, think of keeping the results well-structured and combining later. Try this:

model2 <- function(fname) {
  dat <- read.csv(fname, header = TRUE, sep = ",")
  lim_y <- dat$y >= 3 & dat$y <=6
  lim_x <- dat$x[lim_y]
  lim_z <- dat$z[lim_y]

  iFinal <- dat$x[nrow(dat)]
  iM <- mean(lim_z)

  p <- lm(lim_x ~ lim_z)
  iSlp <- summary(p)$coefficients[2,1]

  iEval <- (iSlp <= 0.05 & iSlp >= -0.05) 

  return(data.frame(
    fname = fname,
    M = iM, Slp = iSlp, Eval = iEval, Final = iFinal,
    stringsAsFactors = FALSE))
}

do.call(rbind, lapply(dfs, model2))
# Warning in summary.lm(p) :
#   essentially perfect fit: summary may be unreliable
# Warning in summary.lm(p) :
#   essentially perfect fit: summary may be unreliable
# Warning in summary.lm(p) :
#   essentially perfect fit: summary may be unreliable
#          fname  M Slp  Eval Final
# 1 ./table1.csv 45 0.1 FALSE    10
# 2 ./table2.csv 45 0.1 FALSE    11
# 3 ./table3.csv 45 0.1 FALSE    12

There are countless ways to do this with the *apply family, but I think this is a decent one.

For some decent reading about *apply, frames within lists, and such, see:

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • great! thanks!! for some reason `for` loop is not working for me, but this works great! I did something similar but the `combine` bit was driving me crazy. I'll keep on trying with the `for` loop anyway; now it's just bugging me! Thanks again! – ebb Apr 05 '18 at 22:22
  • BTW the Trekkie tip was quite enlightening, and I am not kidding. I learned a lot, thanks. – ebb Apr 05 '18 at 22:37