So I have a large data set, with many columns(10) and 100,000 rows. One of the columns is the date of observation with two other corresponding columns, one species and the other year. First, I want to create a new column that will give me the mean date of observation for each species for each year for the first 10% of the observations( for each species for each year). Second, I want to reduce that data set so that only rows involved in the calculation (ie: the first 10%) remain. Finally, it's important that my new data set has the other corresponding columns with information for each observation ie, the location ect. Sample of the data set (there do exist more columns):
date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)
habitat=c(forest,plain,mountain...)
Ex: the first row says species blue was seen on jan 3rd 1901 in a forest.