0

I have a data set with the following columns:

locID       = the location of ID of the observer
yr          = the year of the observation in categorical format: P_year
maxFlock    = a number counted by the observer
lat         = latitude of the location
long        = longitude of the location
state       = US state of the observation
effortDays  = categorical, I, II, III, and IV
effortHours = categorical, A, B, C, D 

Here is a sample of the data frame:

PData

  locID    yr  maxFlock     lat   long state effortDays effortHours 
L4278   P_2000        3   41.42 -73.67    NY        II           C
L4278   P_2000        6   41.42 -73.67    NY       III           C
L4278   P_2000        4   41.42 -73.67    NY       III           C
L4278   P_2012        2   41.42 -73.67    NY       III           B
L4278   P_2012        4   41.42 -73.67    NY        IV           B
L4278   P_2012        8   41.42 -73.67    NY        IV           B
L10494  P_2003        4   42.01 -77.44    NY        IV           C
L10494  P_2003        0   42.01 -77.44    NY        IV           C
L10494  P_2003        8   42.01 -77.44    NY        IV           D
L10494  P_2005        4   42.01 -77.44    NY        IV           C
L10494  P_2005        6   42.01 -77.44    NY        IV           C
L10494  P_2009        8   42.01 -77.44    NY        IV           C

I want to make a new column (labelled: xmf) that computes the average of maxFlock. The average, though, has to be computed for each unique combination of locID, yr, effortDays, and effortHours . If I were to run the code on the above sample, the final product would look like this.

PData

  locID     yr maxFlock         xmf        lat    long  state effortDays effortHours 
L4278   P_2000        3           3       41.42 -73.67    NY        II           C
L4278   P_2000        6           5       41.42 -73.67    NY       III           C
L4278   P_2000        4           5       41.42 -73.67    NY       III           C
L4278   P_2012        2           2       41.42 -73.67    NY       III           B
L4278   P_2012        4           6       41.42 -73.67    NY        IV           B
L4278   P_2012        8           6       41.42 -73.67    NY        IV           B
L10494  P_2003        4           2       42.01 -77.44    NY        IV           C
L10494  P_2003        0           2       42.01 -77.44    NY        IV           C
L10494  P_2003        8           8       42.01 -77.44    NY        IV           D
L10494  P_2005        4           5       42.01 -77.44    NY        IV           C
L10494  P_2005        6           5       42.01 -77.44    NY        IV           C
L10494  P_2009        8           8       42.01 -77.44    NY        IV           C

I originally tried to do this using:

PData$xmf = ave(myData2$maxFlock, myData2$locID, myData2$yr, myData2$effortDays, myData2$effortHours)

But it didn't work (had to kill it after waiting over a half hour), and I'm also not even sure if ave() can do what I want it to do.

I was thinking about trying something with the split-apply-combine methodology, but I don't think that's exactly what I'm looking for, because I would have to subset for locID, then subset for year, then for effortHours OR effortDays, and I don't want to have to make that choice. I want to do it by unique combinations.

It would also be great if there were a fast way to do this. The data I'm working with is about 2.5 million rows, so if-statements inside of for-loops are definitely not ideal.

Thank you!

Heliornis
  • 391
  • 5
  • 18
  • Perhaps using the package `data.table` with something like this: `library(data.table) setDT(PData) PData[, xmf = ave(maxFlock), by = .(locID, yr, effortDays, effortHours)] `. Some more adjustments to get your wanted result should be easy. – R Yoda Sep 23 '17 at 17:55

3 Answers3

1

A solution from dplyr.

library(dplyr)

PData <- PData %>%
  group_by(locID, yr, effortDays, effortHours) %>%
  mutate(xmf = mean(maxFlock)) %>%
  select(c(1:3, 9, 4:8))
PData
# A tibble: 12 x 9
# Groups:   locID, yr, effortDays, effortHours [8]
    locID     yr maxFlock   xmf   lat   long state effortDays effortHours
    <chr>  <chr>    <int> <dbl> <dbl>  <dbl> <chr>      <chr>       <chr>
 1  L4278 P_2000        3     3 41.42 -73.67    NY         II           C
 2  L4278 P_2000        6     5 41.42 -73.67    NY        III           C
 3  L4278 P_2000        4     5 41.42 -73.67    NY        III           C
 4  L4278 P_2012        2     2 41.42 -73.67    NY        III           B
 5  L4278 P_2012        4     6 41.42 -73.67    NY         IV           B
 6  L4278 P_2012        8     6 41.42 -73.67    NY         IV           B
 7 L10494 P_2003        4     2 42.01 -77.44    NY         IV           C
 8 L10494 P_2003        0     2 42.01 -77.44    NY         IV           C
 9 L10494 P_2003        8     8 42.01 -77.44    NY         IV           D
10 L10494 P_2005        4     5 42.01 -77.44    NY         IV           C
11 L10494 P_2005        6     5 42.01 -77.44    NY         IV           C
12 L10494 P_2009        8     8 42.01 -77.44    NY         IV           C

DATA

PData <- read.table(text = "  locID    yr  maxFlock     lat   long state effortDays effortHours 
L4278   P_2000        3   41.42 -73.67    NY        II           C
                 L4278   P_2000        6   41.42 -73.67    NY       III           C
                 L4278   P_2000        4   41.42 -73.67    NY       III           C
                 L4278   P_2012        2   41.42 -73.67    NY       III           B
                 L4278   P_2012        4   41.42 -73.67    NY        IV           B
                 L4278   P_2012        8   41.42 -73.67    NY        IV           B
                 L10494  P_2003        4   42.01 -77.44    NY        IV           C
                 L10494  P_2003        0   42.01 -77.44    NY        IV           C
                 L10494  P_2003        8   42.01 -77.44    NY        IV           D
                 L10494  P_2005        4   42.01 -77.44    NY        IV           C
                 L10494  P_2005        6   42.01 -77.44    NY        IV           C
                 L10494  P_2009        8   42.01 -77.44    NY        IV           C
                 ",
                 header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
  • Thanks for this! It looks like what I want to do. Could you explain what `select(c(1:3, 9, 4:8))` is doing, though? Thank you!! – Heliornis Sep 23 '17 at 19:23
  • @Heliornis That is a way to specify the index number of columns. After the `mutate` call, column `xmf` is the last column of the data frame. I was then changing the position of `xml` as the same as your desired output. This step is optional if the position of `xml` is not important. – www Sep 23 '17 at 19:25
  • Oh wow, got it, thank you! – Heliornis Sep 23 '17 at 19:26
0

You can create a new column, which combines the four columns (locID, yr, effortDays, effortHours). Then tapply on with the new column as INDEX and then simply extract the values.

grouping <- paste(PData$locID,
                  PData$yr,
                  PData$effortDays,
                  PData$effortHours, sep = "_")
agg.vals <- tapply(PData$maxFlock, INDEX = grouping, FUN = mean)
PData["xmf"] <- agg.vals[grouping]
boyangeor
  • 381
  • 3
  • 6
0
df <- aggregate(PData$maxFlock, by = list(PData$locID, PData$yr, PData$effortDays, PData$effortHours), FUN = mean)
names(df) <- c("locID", "yr", "effortDays", "effortHours", "xmf")

df

    locID   yr   effortDays effortHours   xmf
1   L4278  P_2012     III       B         2
2   L4278  P_2012      IV       B         6
3   L4278  P_2000      II       C         3
4   L4278  P_2000     III       C         5
5  L10494  P_2003      IV       C         2
6  L10494  P_2005      IV       C         5
7  L10494  P_2009      IV       C         8
8  L10494  P_2003      IV       D         8
Santosh M.
  • 2,356
  • 1
  • 17
  • 29