-1

So I have a large data set, with many columns(10) and 100,000 rows. One of the columns is the date of observation with two other corresponding columns, one species and the other year. First, I want to create a new column that will give me the mean date of observation for each species for each year for the first 10% of the observations( for each species for each year). Second, I want to reduce that data set so that only rows involved in the calculation (ie: the first 10%) remain. Finally, it's important that my new data set has the other corresponding columns with information for each observation ie, the location ect. Sample of the data set (there do exist more columns):

date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)  
habitat=c(forest,plain,mountain...)

Ex: the first row says species blue was seen on jan 3rd 1901 in a forest.

John
  • 59
  • 1
  • 9
  • 1
    data.table and dplyr are both good packages for this kind of manipulation. Please also see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – r.bot Apr 04 '15 at 19:02
  • I'm really horrid with R, any idea how I'd use dplyr specifically? – John Apr 04 '15 at 21:02

1 Answers1

0

Ok, here's one approach using dplyr. This will get you the mean of a variable, by species and year, using the first 10% of observation for each grouping.

require(dplyr)

# test data set
test <- data.frame(species = c(rep("blue", 100), rep("purple",100)), 
                   year = rep(c(1901, 1902, 1903, 1904, 1905), 40),
                   value = rnorm(200),
                   stringsAsFactors = FALSE)

# checking data set
group_by(test, species, year) %>% summarise(n = n(), mean.value = mean(value))

# by species and year, identify first ten per cent of observations
test <- test %>% group_by(species, year) %>%
  mutate(nth.ob = seq_along(species), n.obs = n(), pc = round((nth.ob/n.obs*100), 2) ) %>% 
  arrange(species, year) # sort for easy viewing

# and check
head(test)
Source: local data frame [6 x 6]
Groups: species, year

  species year      value nth.ob n.obs pc
1    blue 1901 -0.2839094      1    20  5
2    blue 1901 -1.7158035      2    20 10
3    blue 1901  1.1664650      3    20 15
4    blue 1901 -0.0935940      4    20 20
5    blue 1901 -0.1199253      5    20 25
6    blue 1901  0.3461677      6    20 30

# reduce to top 10 %, summarise and drop unwanted variables
out <- test %>%
  filter(pc <= 10) %>% # select first 10% of observations by species and year
  summarise(mean_val = mean(value))
out


Source: local data frame [10 x 3]
Groups: species

   species year    mean_val
1     blue 1901 -0.99985643
2     blue 1902  0.08355729
3     blue 1903  0.67396796
4     blue 1904  0.14425229
5     blue 1905 -0.19426698
6   purple 1901  0.95767665
7   purple 1902 -0.40730494
8   purple 1903  0.10032964
9   purple 1904  0.36295224
10  purple 1905  1.30953008

If you then want the settings in which the first observation was detected, I think the best way to do that would be to do something like

setting <- group_by(species, year) %>% 
    filter(row_number() == 1)

and then join the data to the out data set

r.bot
  • 5,309
  • 1
  • 34
  • 45