1

I have a dataframe in R with this structure:

df1<-data.frame(SiteID=c("A","A","A","B","B","C"),Unrelated=c("dog","cat","catamount","bird","horse","monkey"),AirMonitor=c(1,0,0,0,0,1),WaterMonitor=c(0,1,0,1,0,0),SoilMonitor=c(0,0,1,0,1,0))

with output like this:

  SiteID Unrelated AirMonitor WaterMonitor SoilMonitor
1      A       dog          1            0           0
2      A       cat          0            1           0
3      A catamount          0            0           1
4      B      bird          0            1           0
5      B     horse          0            0           1
6      C    monkey          1            0           0

1 indicates the presence of that type of monitor at that site (I could change the monitor columns to factors with "Y" and "N" values instead if needed). Basically, I want to aggregate monitor presence by SiteID but then keep all the rows in order not to lose the data in the Unrelated column. I want 1 values to win over 0 values. For example, if any row with SiteID A has a 1 for AirMonitor, I want all rows with SiteID A to have a 1 for AirMonitor. Same for the other two monitor types.

Desired output:

  SiteID Unrelated AirMonitor WaterMonitor SoilMonitor
1      A       dog          1            1           1
2      A       cat          1            1           1
3      A catamount          1            1           1
4      B      bird          0            1           1
5      B     horse          0            1           1
6      C    monkey          1            0           0

The real dataset has several more unrelated columns and thousands of rows. There must be some simple way to do this (maybe with aggregate?).

d.b
  • 32,245
  • 6
  • 36
  • 77

2 Answers2

3

Usually, if you want to aggregate and keep the rows intact, ave can help. So, we use lapply to loop through the columns (except the first two) and use ave to calculate max value by SiteID

df1[,-c(1:2)] = lapply(df1[,-c(1:2)], function(a) ave(a, df1$SiteID, FUN = max))
df1
#  SiteID Unrelated AirMonitor WaterMonitor SoilMonitor
#1      A       dog          1            1           1
#2      A       cat          1            1           1
#3      A catamount          1            1           1
#4      B      bird          0            1           1
#5      B     horse          0            1           1
#6      C    monkey          1            0           0
d.b
  • 32,245
  • 6
  • 36
  • 77
  • 1
    Thank you! I replaced `-c(1:2)` wherever it occurred with `c("AirMonitor","WaterMonitor","SoilMonitor")` because I had a lot of unrelated columns to preserve, and it still worked. – rilkehayden Aug 13 '17 at 22:35
  • 1
    I'd probably use `lapply` just to avoid the simplification-to-matrix process that `sapply` does. – thelatemail Aug 13 '17 at 22:55
2

Using dplyr:

df1 %>% group_by(SiteID) %>% mutate_at(vars(-Unrelated), funs(max)) 
Source: local data frame [6 x 5]
Groups: SiteID [3]

  SiteID Unrelated AirMonitor WaterMonitor SoilMonitor
  <fctr>    <fctr>      <dbl>        <dbl>       <dbl>
1      A       dog          1            1           1
2      A       cat          1            1           1
3      A catamount          1            1           1
4      B      bird          0            1           1
5      B     horse          0            1           1
6      C    monkey          1            0           0

Alternatively, if you have multiple variables like Unrelated and don't want to specify them all the time, you could think of something like

df %>% group_by(SiteID) %>% mutate_if(is.numeric, funs(max))

This applies max to each numeric column on a per-group basis.

coffeinjunky
  • 11,254
  • 39
  • 57