0

I have 100 csv files, and I intent to pick and calculate sum of data present in sulfate/nitrate columns as mentioned below.

The CSV format is:

Date   sulfate nitrate ID

1/1/2003    NA  NA  1
1/2/2003    NA  NA  1
1/3/2003    NA  NA  1
1/4/2003    NA  NA  1
1/5/2003    NA  NA  1
1/6/2003    NA  NA  1
1/7/2003    NA  NA  1
1/8/2003    NA  NA  1
1/9/2003    NA  NA  1
1/10/2003   NA  NA  1
1/11/2003   NA  NA  1
1/12/2003   NA  NA  1
1/13/2003   NA  NA  1
1/14/2003   NA  NA  1
1/15/2003   NA  NA  1
1/16/2003   NA  NA  1
1/17/2003   NA  NA  1
1/18/2003   NA  NA  1
1/19/2003   NA  NA  1

All of the 100 files are in a folder and have name 001.csv,002.csv...100.csv

The ID over here is the name of the csv file. All the 100 files are with the above mentioned format.

Here is the code that I have written so far:

pollutantmean <- function(directory,pollutant,id = 1:332)
{
  test<- c('sulfate','nitrate')
  for(i in seq_along(id))
  {
    j<-formatC(i, width=3, flag="0")
    temp<-"C:/Users/Himanshu/Downloads/rprog-data-specdata/"
    temp1<-paste(temp,directory,sep="")
    filepath<- file.path(temp1,paste(j,".csv",sep=""))

    if(test[1]==pollutant)
    {
      data<-read.csv(filepath,header = TRUE, sep = "\t",colClasses=c(NA,"sulfate",NA,NA))
      sum(x=data,na.rm=FALSE)
    }
    else if(test[2]==pollutant)
    {
      data<-read.csv(filepath,header = TRUE, sep = "\t",colClasses=c(NA,NA,"nitrate",NA))
      sum(x=data,na.rm=FALSE)
    }
    data
  }

}

I got below error on executing the statement on R studio's command prompt-

data<-read.csv(filepath,header = TRUE, sep = "\t")[,c('nitrate')]

Error --

Error in `[.data.frame`(read.csv(filepath, header = TRUE, sep = "\t"),  : 
  undefined columns selected 

Another way I tried was -

data<-read.csv(filepath,header = TRUE, sep = "\t",colClasses=c(NA,"sulfate",NA,NA))

Error in this case was --

Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  cols = 1 != length(data) = 4

This is what user will put in R's command prompt -

pollutantmean("specdata", "nitrate", 1:72)

Here first argument is the directory reference, 2nd argument is the column name reference and 3rd argument is the number of CSV files to be picked up.

2 Answers2

0
pollutantmean <- function(directory,pollutant,id=1:332){
  #pollutant can be only character: "sulfate" or "nitrate"
# id is numeric and can take values from 1 to 332
    temp<-paste0("C:/Users/Himanshu/Downloads/rprog-data-specdata/",directory)


for (i in seq_along(id)){
   j<-formatC(i, width=3, flag="0")
        filepath<- file.path(temp,paste0(j,".csv"))
        data<-read.csv(filepath,header = TRUE, sep = ",")
      if (pollutant=="sulfate"){
      return(sum(data[complete.cases(data[,"sulfate"]),"sulfate"]))
      } 
      if (pollutant=="nitrate"){
        return(sum(data[complete.cases(data[,"nitrate"]),"nitrate"]))
      }
    }
}

#check

 pollutantmean (test,"sulfate",1:332) 
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • Code seems to be perfect for 1 set of file. However, my requirement is to calculate the sum across multiple files. So when I put 1:X, where X is the limit of the files to be read. It gives me error: Error in file(file, "rt") : invalid 'description' argument –  Feb 14 '15 at 03:22
  • Do you want to read the sum for each file or sum of all files? – Metrics Feb 14 '15 at 03:24
  • Sum of each along multiple files. Currently sum of one file is being calculated. Similarly sum of one file will be calculated multiple times, which will be the final ans. –  Feb 14 '15 at 03:28
  • Did you make any change in the code. I get the same error. –  Feb 14 '15 at 03:30
  • You don't need to use `for loop`. Just run `lapply` and then you will get sum for each files. – Metrics Feb 14 '15 at 03:33
  • Yes I got what you are intending to do. However, my objective was simply to pass - pollutantmean("specdata","sulfate",1:100) and get the sum. The last 2 lines are looking far too complicated than I expected. –  Feb 14 '15 at 03:46
  • And are not intending to do what I intend to achieve. Why not use a loop inside the function, return the sum value in a global variable and keep on doing this till the loop ends? –  Feb 14 '15 at 03:47
  • It should have, but do not know exactly what is wrong with the code. It is picking up just the first file! –  Feb 14 '15 at 04:14
  • You mentioned the error earlier. This may help: http://stackoverflow.com/questions/5568847/how-to-open-csv-file-in-r-when-r-says-no-such-file-or-directory – Metrics Feb 14 '15 at 04:36
  • No there is no error anymore. However, as I mentioned the loop does not seem to iterate... –  Feb 14 '15 at 06:42
0
  • I believe the if statements are unnecessary.
  • I'm lazy so I chained things together with the Magrittr pipe from
    dplyr (%>%)
  • I'm also of the opinion that lapply was the way to go with reading in all those csv.

so all this does is: create the list of names then read all the csvs into a list then grab the specified sums of each csv then reduce the list of sums to a data.frame then add the column of csv-names to the data.frame

I hope this works.

pollutantmean <- function(directory,pollutant,id=1:332){
  require(dplyr)
  formatC(seq_along(id), width=3, flag="0") %>% 
    paste0(.,'.csv') %>% 
  file.path("C:","Users","Himanshu","Downloads","rprog-data-specdata",directory,.) %>%
    lapply(.,{. %>% read.csv(.,header = TRUE, sep = ",")}) %>%
        bind_rows() %>%
        select(pollutant=contains(pollutant)) %>% 
        summarise(mean=mean(pollutant,na.rm=T)) %>% 
    .$mean
}

edit

found typo

JARS3N
  • 3
  • 4