I have a dataframe that looks like this:
set.seed(300)
df <- data.frame(site = sort(rep(paste0("site", 1:5), 5)),
value = sample(c(1:5, NA), replace = T, 25))
df
site value
1 site1 NA
2 site1 5
3 site1 5
4 site1 5
5 site1 5
6 site2 1
7 site2 5
8 site2 3
9 site2 3
10 site2 NA
11 site3 NA
12 site3 2
13 site3 5
14 site3 4
15 site3 4
16 site4 NA
17 site4 NA
18 site4 4
19 site4 4
20 site4 4
21 site5 NA
22 site5 3
23 site5 3
24 site5 1
25 site5 1
As you can see, there are several missing values in the value
column. I need to replace missing values in the value
column with the mean for a site. So if there is a missing value for value
measured at site1
, I need to impute the mean value
for site1
. However, the dataframe is constantly being added to and imported into R, and the next time I import the dataframe it will likely have increased to something like 50 rows in length and there are likely to be many more missing values in value
. I need to make a function that will automatically detect which site a missing value in value
was measured at, and impute the missing value for that particular site. Could anybody help me with this?