-1

I am newbie in R and have got to calculate the mean of column sulf from 332 files. The mean formulas bellow works well with 1 file . The problem comes when I attempt to calculate across the files.

Perhaps the reading all files and storing them in mydata does not work well? Could you help me out?

Many thanks

pollutantmean <- function(specdata,pollutant=xor(sulf,nit),i=1:332){
            specdata<-getwd()
            pollutant<-c(sulf,nit)

            for(i in 1:332){
               mydata<-read.csv(file_list[i])
            }


            sulfate <- (subset(mydata,select=c("sulfate")))
            sulf <- sulfate[!is.na(sulfate)]
            y <- mean(sulf)


       print(y) 


     }
bouncyball
  • 10,631
  • 19
  • 31
Susan
  • 11
  • 2
  • What exactly are you trying to do? Calculate the mean for each file separately and store it in a vector, calculate the mean for _all_files...? – bouncyball Jan 29 '17 at 19:26
  • @bouncyball I believe the latter is true. The tools to use here are `list.files` and `lapply`/`sapply`. – Roman Luštrik Jan 29 '17 at 21:06

2 Answers2

0

This is not tested, but the steps are as followed. Note also that this kind of questions are being asked over and over again (e.g. here). Try searching for "work on multiple files", "batch processing", "import many files" or something akin to this.

lx <- list.files(pattern = ".csv", full.names = TRUE)

# gives you a list of 
xy <- sapply(lx, FUN = function(x) {
  out <- read.csv(x)
  out <- out[, "sulfate", drop = FALSE] # do not drop to vector just for fun
  out <- out[is.na(out[, "sulfate"]), ]
  out
  }, simplify = FALSE)

xy <- do.call(rbind, xy) # combine the result for all files into one big data.frame
mean(xy[, "sulfate"]) # calculate the mean
# or
summary(xy)

If you are short on RAM, this can be optimized a bit.

Community
  • 1
  • 1
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
0

thank you for your help.

I have sorted it out. the key was to use full.names=TRUE in list.files and rbind(mydata, ... ) as otherwise it reads the files one by one and does not append them after each other, which is my aim

See below. I am not sure if it is the most "R" solution but it works

       pollutantmean<-function(directory,pollutant,id=1:332){
         files_list <- list.files(directory, full.names=TRUE)   
         mydata <- data.frame()
            for (i in id) {                                
            mydata <- rbind(mydata, read.csv(files_list[i]))
            }


              if(pollutant %in% "sulfate")
                 {
                  mean(mydata$sulfate,na.rm=TRUE)

                   }
               else
                 {if(pollutant %in% "nitrate")
            {
            mean(mydata$nitrate,na.rm=TRUE)
            }
           else
            {"wrong pollutant"  
            }  
          }
         }

`

Susan
  • 11
  • 2