2

I have mutltiple species presence observations from location split by collection time, but would like to have them for whether the species appeared in that location at any time. My data currently looks like this:

### Location   Collection_time   Species   Presence
#    loc1        6-8PM             Sp1        Y
#    loc1        6-8PM             Sp2        N
#    loc1        8-10PM            Sp1        N
#    loc1        8-10PM            Sp2        Y
#    loc1        10-12PM           Sp1        N
#    loc1        10-12PM           Sp2        N
#    loc2        6-8PM             Sp1        Y
#    loc2        6-8PM             Sp2        N
#    loc2        8-10PM            Sp1        N
#    loc2        8-10PM            Sp2        N
#    loc2        10-12PM           Sp1        N
#    loc2        10-12PM           Sp2        N

But what I would like to achieve is to have a new dataframe with one presence absence value by location, not by the collection time, so like:

### Location  Species   Presence
     loc1      Sp1          Y 
     loc1      Sp2          Y 
     loc2      Sp1          Y 
     loc2      Sp2          N 

New to R and I don't have a strong enough grasp on it to work out how to achieve this yet, so stuck before the stage where I have reasonably lucid attempts at code. Thanks in advance for help!

westpier
  • 45
  • 3

2 Answers2

3

A base R solution

aggregate(Presence ~ Location + Species, df, max, na.rm = T)

#   Location Species Presence
# 1     loc1     Sp1        Y
# 2     loc2     Sp1        Y
# 3     loc1     Sp2        Y
# 4     loc2     Sp2        N

You can use max() because max("Y", "N") returns "Y" because of the encoding.

halfer
  • 19,824
  • 17
  • 99
  • 186
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • 1
    Thank you! Really nice one line solution. In the interest of better understanding, why is Y encoded as greater than N? Is it an alphabetical value thing? – westpier Jul 17 '20 at 12:23
  • 2
    @westpier Take a look at https://stackoverflow.com/questions/37914917/using-max-function-on-character-vectors-in-r and https://stat.ethz.ch/R-manual/R-devel/library/base/html/Extremes.html : "[...] Character versions are sorted lexicographically, [...]" – Martin Gal Jul 17 '20 at 12:46
1

You could use dplyr, assuming your data is stored in a data.frame named df:

df %>%
  group_by(Location, Species) %>%
  summarise(Presence=ifelse(max(Presence=="Y")==1, "Y", "N"))

returns

  Location Species Presence
  <chr>    <chr>   <chr>   
1 loc1     Sp1     Y       
2 loc1     Sp2     Y       
3 loc2     Sp1     Y       
4 loc2     Sp2     N   
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • Thank you! I had a feeling it there would be a solution with group_by and ifelse but I really need to get a better grasp of the syntax! Really appreciated. – westpier Jul 17 '20 at 10:32
  • Mindblowing fact: As Darren Tsai pointed out, you can replace that whol `ifelse()`-part with `max(Presence)`. – Martin Gal Jul 17 '20 at 11:19