0

i have a set of CSV files. Each CSV file has a unique ID on it, and other columns like "date", "sulfate", "nitrate". This is data about air pollution.

The function must use 3 arguments: "directory", "pollutant", "id".

This is the original data format (for the 001.csv file):

Date        Sulfate    Nitrate    ID
2013-02-04  2.27       NA         1
2013-02-05  NA         1.15       1

This is my function so far:

pollutantmean <- function (directory, pollutant, id = 1:332){
  files_full <- list.files (directory, full.names = TRUE)
  dat <- data.frame ()
  for (i in id){
    dat <- rbind (dat, files_full[i])
  }
  datasub <- dat[,pollutant]
  }

1) When users enter this: pollutantmean("specdata", "nitrate", 70:72)

They should get (DESIRE OUTPUT):
1.706

Instead i get:

Error in `[.data.frame`(dat, , pollutant) : undefined columns selected 

 In addition: Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "specdata/071.csv") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "specdata/072.csv") :
  invalid factor level, NA generated

What do these errors mean?

ogondiaz
  • 49
  • 3
  • 12
  • 1. I think you mean that the user enters `pollutantmean("specdata", "Nitrate", 70:72)` -- proper capitalization of the column name. 2: `70:72` does not make sense with your example data. 3: (last but **not** least) You are taking the average of two columns. Don't you mean to filter by what the user entered for the second parameter? – Matthew Lundberg Jun 16 '14 at 05:06
  • @MatthewLundberg 1. The column name is "nitrate" without capitalization on the csv file. 2. The example for formatted data is just for the file: 001.csv. They data for the 070.csv is formatted as the example, but n the ID column it says 70. 3. The filter would be the second and third parameters. The mean is just of one column, well "nitrate" or the "sulfate" column, but before it was subset by the ID column (e.g the number of the csv file). Thanks. – ogondiaz Jun 16 '14 at 05:30
  • 1
    You're not actually reading in the data anywhere. you're just putting in the string value of `files_full`, there's no read.table. If this is for the Coursera class, there are hundreds of questions already asked about this simple task. Try searching for `[r] nitrate` or `[r] 332` – MrFlick Jun 16 '14 at 05:42
  • @MrFlick yes, it's for coursera class. My bad, i had forgotten this line: dat <- rbind (dat, read.csv(files_full[i])) Nevertheless, when i type: pollutantmean("specdata", "nitrate", 70:72), R does not return any result. Why? I dont get neither the erros that i got before. – ogondiaz Jun 16 '14 at 05:47
  • Well, when you look at your function, what do you think it should return? By default R will return the last object evaluated in the function. Your last statement is an assignment and assignments actually return the right-hand value but do so invisibly (a value is being returned but unless you explicitly print it, you won't see it). You can remove the assignment to just return the pollutant column of dat, or add another line with just `datasub` to return that value. – MrFlick Jun 16 '14 at 06:22
  • I hope this link might help ya : http://stackoverflow.com/questions/16819956/invalid-factor-level-na-generated – heybhai Jun 16 '14 at 06:28

1 Answers1

1

I found two problems with your code

  1. list.files just lists your files. It won't read the files to work space

  2. While sub setting data.frame you have to give column name as character. ex: dat[,"column_name"]

I have modified the function for you try this.

    pollutantmean <- function (directory, pollutant, id = 1:332){
      files_full <- list.files (directory, full.names = TRUE)
      dat <- data.frame()
      for (i in id){

        dat <- rbind (dat, read.table(files_full[i],sep="",.......))
      }
      datasub <- dat[,"pollutant"]

  }

More faster way to do this is to use data.table package. To warn you here you need to give just column name to subset data.table not as character. ex: dat[,column_name]

    pollutantmean <- function (directory, pollutant, id = 1:332){
      library(data.table)

      files_full <- list.files (directory, full.names = TRUE)
      dat_list <- list()
      for (i in id){

        dat_list[[i]] <- fread(files_full[i],sep="",.......)
      }
      dat <- rbindlist(dat_list)
      return(mean(dat[,pollutant]))

  }
Koundy
  • 5,265
  • 3
  • 24
  • 37
  • Thanks so much! but i get this error with your code: Error in read.table(files_full[i], sep = "", .......) : object '.......' not found Why is that? Explain pls. This is when entering: pollutantmean("specdata", "sulfate", 1:40) – ogondiaz Jun 16 '14 at 18:55
  • Hi, by "...." i mean "and so on" ie., you can write other arguments you want. it doesn't mean to write ..... Hope you understood. – Koundy Jun 17 '14 at 04:00
  • @user2743244... I have this code: 'pollutantmean <- function (directory, pollutant, id = 1:332){ files_full <- list.files (directory, full.names = TRUE) dat <- data.frame() for (i in id){ dat <- rbind (dat, read.table(files_full[i])) } datasub <- dat[,"pollutant"] }' But still get this error: Error in `[.data.frame`(dat, , "pollutant") : undefined columns selected. **Notice that i enter this to get the results: pollutantmean("specdata", "nitrate", 70:72) – ogondiaz Jun 17 '14 at 17:24
  • write argument header=TRUE in the read.table command. If you are working with .csv files you have to add sep = ",". All these are very basic things. Read help for read.table function. – Koundy Jun 18 '14 at 05:09