0

Edits - I have removed previous writing, making way to show my edits and if you can help me out

I want to begin a for loop that will go through all the 332 cases of the Directory, pick out the nitrate or sulphate values and take the mean from these values.

I have figured how to do this individually, however, this will take a lot of writing with this method. How can I implement this into a for loop? please just point me into the direction, without giving the full answer.

specdata <- list.files(getwd(), pattern="*.csv")
directory <- lapply(specdata, read.csv)
name_1 <- get("nitrate", envir = as.environment(directory[[1]]))
name_2 <- na.omit(name_1)
name_3 <- name_2[1:122]

pollutantmean <- function(directory, pollutant, id = 1:332) {
for( ?) {
   ???
}
??????
      }

I have gone through a different method. This involved removing the selected columns (Sulphate and Date), leaving only nitrate and ID. I then omitted the NA values, and now the ID counts each nitrate value for the 332 cases. My next step is deciding how I am going to select ID by integer value and not by row. for example, if I print(final_df$ID[1:32]) it only sends back the integer values of the first 32 rows, rather than the first 32 cases, i.e. 1, 2, 3 ... 32 (previosly, it was 1, 1, 1 ... 1 as the list is large and and the first 1000 are 1s, 2000s are 2s and so forth, these are not exact)

By doing so, I can then select the nitrate values(numeric) by each ID value(Integer), and find the mean between these values. How would I go about doing this?

The data is something like this

Data      Sulphate  Nitrate  ID
10/10/10   0.576     0.784    1
10/10/10   0.738     0.687    1
   .         .         .      .
   .         .         .      .
11/11/11   0.954     1.093    2
   .         .         .      .
   .         .         .      .
   .         .         .      .
13/13/13   0.495     0.586   332

final_df$date <- NULL
final_df$Sulphate <- NULL

So far the code looks like this

                  specdata <- list.files(getwd(), pattern="*.csv")
                  directory <- lapply(specdata, read.csv)
                  directory_final <- do.call(rbind, directory)

one <- select(directory, nitrate:ID)         a <- select(directory, sulfate, ID)
two <- na.omit(two)                          b <- na.omit(a)
three <- filter(two, ID %in% 1:30)           c <- filter(b, ID %in% 1:30)
four <- mean(two$nitrate)                    d <- mean(c$sulfate)

It works in the way that it can extract the values I may need, however, it is very impractical in the long run. I have had to create 8 pieces of code to retrieve the mean of the list of integers belonging to either sulfate or nitrate. And if I want another set of values I would then have to go back to three & c, to change these values and then repeat four & d. I will be working on how to incorporate these into one list that can extract the mean from these integer values for either sulfate or nitrate in one code. I do expect that creating a function will be needed, so any tips are appreciated!

Lime
  • 738
  • 5
  • 17
  • 1
    Use `get` or `mget` with `lapply` for instance. **I am not looking for actual answers** – NelsonGon Jul 05 '19 at 11:22
  • How do you use *pollutant* and why re-assign *Directory* to itself? Are you trying to get the mean from each data frame in *Directory* list separately or mean across all dfs in list? – Parfait Jul 05 '19 at 12:12
  • @Parfait The code is essentially my prototype, after looking into it since your question. You are right; the Directory is already there, so there would be no need to reassign it inside the function. I will use pollutant to find either "nitrate" or "Sulphate" in the Directory. I have changed directory using this code ```get("*.csv", envir = as.environment(Specdata) Directory < - lapply(Specdata, read.csv)``` I am trying to find the mean value of pollutant (nitrate or sulphate) across the entire directory, i.e. 1:33, 5:102 ... anywhere between 1 - 332, as that is the count of .csv files. – Lime Jul 05 '19 at 13:32
  • Please [show a sample of the data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Is *Nitrate* and *Sulfate* in their own columns or indicators in a *Pollutant* column with values in a separate numeric column? Do all dfs in *Directory* have the same exact columns and number? – Parfait Jul 05 '19 at 15:42
  • Well, what can I say, I find the link very difficult to understand. I will give it a go with explaining the dataset. Directory has a list of 332 cases, with a length of 332. Each list within the directory has 4 columns; sulphate, nitrate, date and ID. The rows vary for each list and columns stay the same. I want to find the mean value of nitrate among all the lists in directory, so that could mean 1:12, or 24:55, or 100:300, how do I repdouce a code to obtain the mean nitrate values across many lists? – Lime Jul 05 '19 at 16:24
  • My next problem is trying to have the ID values match nitrate in the integer sequence when the NA's are removed from nitrate, as the integer (ID) will still count an NA. Trying to remove the NA and ID counting for the numeric values that are not NA, I am currently finding difficult. – Lime Jul 07 '19 at 10:31

1 Answers1

0

Simply concatenate your list of data frames and then take the needed means of the columns. Consider even tapply (sibling to lapply) to calculate means by case number or ID's.

# RETRIEVE ALL CSVs IN WORK DIRECTORY
specdata <- list.files(getwd(), pattern="*.csv")

# BUILD LIST OF DATA FRAMES
df_list <- lapply(seq_along(specdata), function(i)  
       transform(read.csv(specdata[i]), case_no = i))

# COMBINE ALL DFs INTO SINGLE, LONG DF
final_df <- do.call(rbind, df_list)

# CALCULATE MEANS BY 332 CASE NUMBERS
nitrate_mean_case_vector <- with(final_df, tapply(Nitrate, case_no, mean))
sulfate_mean_case_vector <- with(final_df, tapply(Sulfate, case_no, mean))

# CALCULATE MEANS BY FIRST 20 IN EACH CASE
nitrate_mean_id_vector <- with(final_df, tapply(Nitrate, case_no, 
                                   function(x) head(mean(x), 20)))
sulfate_mean_id_vector <- with(final_df, tapply(Sulfate, case_no, 
                                   function(x) head(mean(x), 20)))
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • It works only to get the mean of the entire list of 332 cases. It is not what I am looking for if I wanted to get the mean of the first 20, or 20:60, or 100:150 cases I will not be able to do it, the entire list becomes 772087 rows. each case had its own amount of rows, so I would not then know how many from the rows to take to know the mean nitrate from each case. That is why the function works, I would then pick pollutantmean(directory, pollutant = "nitrate", 1:332 or 1:20 etc.). I just find it difficult to work the for loop to achieve this. – Lime Jul 06 '19 at 09:44
  • ```names(directory[[1]])``` returns "Date", "Sulphate", "Nitrate", "ID", I figured that finding a way on being able to loop ID for every case and then being able to calculate the mean of pollutant from those cases, whether individually or combined. I am stuck on trying to get ID loop. – Lime Jul 06 '19 at 09:46
  • See update that adds a column for case number, 1-332, then runs means by each case with `tapply` to return a named, numeric vector. You can also index the final columns: `mean(final_df$Nitrate[1:20])`. – Parfait Jul 06 '19 at 14:07
  • Could you explain why you made another column, case_no, when ID works the same? I have tried the line of code; it returns all that are NA values for 332 cases, the last function only returns Null i.e. ```mean(final_df$nitrate[1:20]```, it still cannot seem to extract nitrate from the column. This may be an error because of the NA values. Calculating means by each case, the last unit '20', seems to make no difference to the array. I could try omitting the NA however, I am not at that level to incorporate the function into your code. – Lime Jul 07 '19 at 09:45
  • I had an idea that would extract nitrate from final_df, with the method I used above. and I figured I could also do the same with ID or case_no, and then combining the two and taking the mean from this. Not sure how practical this will turn out, however, I will give it a go. Something like turning the two into a vector and then a matrix, back into list form and extracting the mean values from that? seeing as my problem with doing this is that nitrate becomes a numeric and ID is an integer. To combine this and place them into list form I could then extract the mean of nitrate using ID. I reckon – Lime Jul 07 '19 at 10:12