2

I'm trying to create a simple function that will filter through my data frame and calculate means of either Ozone or PM while Site ID has a certain value. Data looks like this:

> dput(head(df))
structure(list(ozone = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), pm = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), site.id = c(1, 1, 1, 1, 1, 1)), row.names = c(NA, 
6L), class = "data.frame")

My code is the following:

 function1<-function(data, air_pollutant, site_id) 
  {
  first_step<-subset(data, site_id)
  pollution<-mean(first_step$air_pollutant, na.rm=TRUE)
  pollution
  }

However, when I try the following:

function1(dat_csv, ozone, 1:115) 

It throws an error that

2: In mean.default(mean$air_pollutant, na.rm = TRUE): 
    argument is not numeric or logical: returning NA
M--
  • 25,431
  • 8
  • 61
  • 93
  • 2
    Please share your input data by adding the output of `dput(my_input_data)` to the question, and make sure it is possible to reproduce the error you get on other machines using only the code and data provided in the question. – IceCreamToucan Sep 18 '19 at 15:34
  • `?subset` reads, in part, "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." I don't know if that is your problem, but it is a bit of a red flag to be using `subset` inside of a function definition like that. – John Coleman Sep 18 '19 at 15:34
  • 1
    I wouldn't recommend naming a variable `mean` – Jonny Phelps Sep 18 '19 at 15:35
  • @IceCreamToucan what do you mean by input data? I've given the head of the data, but cannot upload all of it as it has 1000s data points. Sorry if I misunderstood – polarsandwich Sep 18 '19 at 15:41
  • 2
    Possible duplicate of [Dynamically select data frame columns using $ and a vector of column names](https://stackoverflow.com/questions/18222286/dynamically-select-data-frame-columns-using-and-a-vector-of-column-names) – divibisan Sep 18 '19 at 15:45
  • You can't select a variable by name using a string with `$`, you need to use square bracket notation `[` – divibisan Sep 18 '19 at 15:46
  • 1
    I added the data using dput(head(data)), is it ok? – polarsandwich Sep 18 '19 at 15:48

1 Answers1

1

Valid points in the comments above. Also, use a character for the air pollutant when calling the function. I modified your function to make it working:

df <- data.frame(year = c(2010, 2010, 2013),
           ozone = c(34,55,112),
           pm = c(2,2,3),
           site_id = c(1,1,2))

function1<-function(data, air_pollutant, site_id) 
{
  ss <- data[data$site_id %in% site_id, ]
  pollution<-mean(ss[[air_pollutant]], na.rm=TRUE)
  pollution
}

function1(df, "ozone", 1:115)
slava-kohut
  • 4,203
  • 1
  • 7
  • 24
  • Amazing, this works! Thank you so much. Could I just ask so that I know for the future, why did you use the 'data$site_id %in% site_id'? Is it to let the function know that the values for the site_id are located within the data frame? Also, why space at the end? Thanks a lot! – polarsandwich Sep 18 '19 at 15:53
  • @polarsandwich `'data$site_id %in% site_id'` allows you to get a list of rows that contain desired `site_id` (input to the function). `data$site_id` extracts the `site_id` column of the data frame. Space at the end is not required, you can remove it if you like. It just makes the code more readable. – slava-kohut Sep 18 '19 at 16:41