Calculate the mean of one column across multiple .csv files How?

Question

I am newbie in R and have got to calculate the mean of column sulf from 332 files. The mean formulas bellow works well with 1 file . The problem comes when I attempt to calculate across the files.

Perhaps the reading all files and storing them in mydata does not work well? Could you help me out?

Many thanks

pollutantmean <- function(specdata,pollutant=xor(sulf,nit),i=1:332){
            specdata<-getwd()
            pollutant<-c(sulf,nit)

            for(i in 1:332){
               mydata<-read.csv(file_list[i])
            }


            sulfate <- (subset(mydata,select=c("sulfate")))
            sulf <- sulfate[!is.na(sulfate)]
            y <- mean(sulf)


       print(y) 


     }

What exactly are you trying to do? Calculate the mean for each file separately and store it in a vector, calculate the mean for _all_files...? — bouncyball, Jan 29 '17 at 19:26
@bouncyball I believe the latter is true. The tools to use here are `list.files` and `lapply`/`sapply`. — Roman Luštrik, Jan 29 '17 at 21:06

score 0 · Answer 1 · edited May 23 '17 at 11:53

0

This is not tested, but the steps are as followed. Note also that this kind of questions are being asked over and over again (e.g. here). Try searching for "work on multiple files", "batch processing", "import many files" or something akin to this.

lx <- list.files(pattern = ".csv", full.names = TRUE)

# gives you a list of 
xy <- sapply(lx, FUN = function(x) {
  out <- read.csv(x)
  out <- out[, "sulfate", drop = FALSE] # do not drop to vector just for fun
  out <- out[is.na(out[, "sulfate"]), ]
  out
  }, simplify = FALSE)

xy <- do.call(rbind, xy) # combine the result for all files into one big data.frame
mean(xy[, "sulfate"]) # calculate the mean
# or
summary(xy)

If you are short on RAM, this can be optimized a bit.

edited May 23 '17 at 11:53

Community

1
1

answered Jan 29 '17 at 21:11

Roman Luštrik

69,533
24
154
197

Why `sapply` with *simplify = FALSE*? Why not just call `lapply`? – Parfait Jan 29 '17 at 21:16
@Parfait personal choice, no reason. – Roman Luštrik Jan 29 '17 at 21:22

score 0 · Answer 2 · answered Feb 04 '17 at 15:27

thank you for your help.

I have sorted it out. the key was to use full.names=TRUE in list.files and rbind(mydata, ... ) as otherwise it reads the files one by one and does not append them after each other, which is my aim

See below. I am not sure if it is the most "R" solution but it works

       pollutantmean<-function(directory,pollutant,id=1:332){
         files_list <- list.files(directory, full.names=TRUE)   
         mydata <- data.frame()
            for (i in id) {                                
            mydata <- rbind(mydata, read.csv(files_list[i]))
            }


              if(pollutant %in% "sulfate")
                 {
                  mean(mydata$sulfate,na.rm=TRUE)

                   }
               else
                 {if(pollutant %in% "nitrate")
            {
            mean(mydata$nitrate,na.rm=TRUE)
            }
           else
            {"wrong pollutant"  
            }  
          }
         }

`

Calculate the mean of one column across multiple .csv files How?

2 Answers2