Percentile in a data frame using two columns

Question

Perhaps it´s an easy problem but I´m stuck.

My data frame (which come from a yearly survey) contains length data of several especies by year and by haul. I want to obtain, for each year, the 95 percentile for each species. A sample of my dataframe,

    structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2014L, 2016L,
    2015L, 2016L, 2014L, 2016L, 2015L, 2015L, 2016L, 2016L, 2014L, 2014L,
    2014L, 2015L, 2016L, 2016L), cod_haul = structure(c(72L, 51L, 77L,
    43L, 20L, 92L, 75L, 93L, 9L, 103L, 65L, 63L, 85L, 102L, 27L, 24L,
    14L, 55L, 114L, 105L), .Label = c("N14_02", "N14_03", "N14_04",
    "N14_06", "N14_07", "N14_08", "N14_10", "N14_13", "N14_16", "N14_17",
    "N14_19", "N14_21", "N14_24", "N14_25", "N14_26", "N14_27", "N14_28",
    "N14_29", "N14_30", "N14_32", "N14_33", "N14_35", "N14_37", "N14_39",
    "N14_40", "N14_41", "N14_42", "N14_44", "N14_51", "N14_54", "N14_55",
    "N14_56", "N14_57", "N14_58", "N14_61", "N14_62", "N14_64", "N14_66",
    "N14_67", "N15_01", "N15_03", "N15_07", "N15_11", "N15_12", "N15_14",
    "N15_16", "N15_18", "N15_19", "N15_20", "N15_22", "N15_23", "N15_24",
    "N15_25", "N15_26", "N15_27", "N15_28", "N15_29", "N15_30", "N15_31",
    "N15_32", "N15_36", "N15_37", "N15_39", "N15_41", "N15_44", "N15_46",
    "N15_47", "N15_48", "N15_52", "N15_55", "N15_56", "N15_58", "N15_59",
    "N15_60", "N15_62", "N15_63", "N15_64", "N15_66", "N15_67", "N16_04",
    "N16_06", "N16_07", "N16_08", "N16_11", "N16_12", "N16_13", "N16_15",
    "N16_17", "N16_18", "N16_20", "N16_22", "N16_23", "N16_25", "N16_28",
    "N16_29", "N16_30", "N16_31", "N16_32", "N16_33", "N16_34", "N16_35",
    "N16_37", "N16_40", "N16_41", "N16_45", "N16_46", "N16_47", "N16_48",
    "N16_49", "N16_50", "N16_51", "N16_52", "N16_53", "N16_54", "N16_56",
    "N16_58", "N16_60", "N16_61", "N16_62", "N16_63", "N16_64","N16_66"),
     class = "factor"), haul = c(58L, 23L, 64L, 11L, 32L, 23L, 62L, 25L,
     16L, 40L, 44L, 39L, 12L, 37L, 42L, 39L, 25L, 27L, 54L, 45L), name =
     structure(c(2L, 23L, 11L, 2L, 19L, 15L, 18L, 16L, 3L, 21L, 16L, 21L,
     20L, 19L, 3L, 18L, 16L, 11L, 7L, 13L), .Label = c("Argentina 
     sphyraena", "Arnoglossus laterna", "Blennius ocellaris", "Boops 
     boops", "Callionymus lyra", "Callionymus maculatus", "Capros aper",
     "Cepola macrophthalma", "Chelidonichthys cuculus", "Chelidonichthys
     lucerna", "Conger conger", "Eutrigla gurnardus", "Gadiculus 
     argenteus", "Galeus melastomus", "Helicolenus dactylopterus", 
     "Lepidorhombus boscii", "Lepidorhombus whiffiagonis", "Merluccius
      merluccius", "Microchirus variegatus", "Micromesistius poutassou",
      "Phycis blennoides", "Raja clavata", "Scyliorhinus canicula", 
      "Solea solea", "Trachurus trachurus", "Trisopterus luscus"), class
      = "factor"), length = c(9L, 18L, 50L, 12L, 14L, 12L, 31L, 19L, 15L,
      16L, 26L, 48L, 23L, 10L, 16L, 24L, 12L, 46L, 75L, 13L), number =
      c(5L, 4L, 1L, 2L, 29L, 5L, 2L, 14L, 1L, 1L, 4L, 1L, 29L, 21L, 2L,
      1L, 2L, 1L, 2L, 14L)), row.names = c(NA, 20L), class = 
       "data.frame")

I haven't been able to find how to solve it even though I have tried several approaches, but none worked.

Any suggestions or advice is much appreciated.

Thanks!

Ps: Although it isn´t absolutely necessary, it would be great if the percentile could be added to the dataframe as a new column.

It would be great if you add a MRE (not a link to the whole data set but only as much code/data as is necessary to reproduce your question). More info on [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) — dario, Feb 09 '21 at 14:04
To expand on dario's comment - you'll get help a lot faster if you put a small copy/pasteable sample of data into your question. 10-20 rows is plenty, and it's easy to do: `dput(your_data[1:10, ])` will give a copy/pasteable version of the top 10 rows of your data, and people won't have to download it, put it in their working directory, identify its format, read it in, correct any column classes, etc., all before starting to actually help with your problem. — Gregor Thomas, Feb 09 '21 at 14:27
@GregorThomas, I have edited my question following your advice. Thanks — Juan Carlos, Feb 10 '21 at 08:34
The beginning of your data structure seems to be missing. It should start with `structure(list...`, but what you show starts with `2015L, ...` — Gregor Thomas, Feb 10 '21 at 13:25

score 1 · Answer 1 · answered Feb 09 '21 at 15:39

1

    df %>%
group_by(year) %>%
summarize(species.95 = quantile(species, 0.95)

I cannot download your dataframe but you can use the quantile function to find the 95% for each species.

answered Feb 09 '21 at 15:39

shinyy

9
4

Thanks @shinyy. It wasn´t exactly what I was looking for, but your answer has helped a lot. – Juan Carlos Feb 10 '21 at 09:24

score 1 · Answer 2 · answered Feb 09 '21 at 16:13

1

if I get you right

library(tidyverse)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "collector")), skip = 1L), class = "col_spec"))
df %>% 
  group_by(year, name) %>% 
  mutate(q95 = quantile(length, probs = 0.95))

or

library(data.table)
setDT(df)
df[, q95 := quantile(length, probs = 0.95), by = list(year, name)][order(name, year)]

answered Feb 09 '21 at 16:13

Yuriy Saraykin

8,390
1
7
14

Thanks @Yuriy Saraykin for your approach. Perhaps it´s a problem with my concept of percentile but, shouldn't the number of individuals with the same length ("number") be, somehow, part of the script? – Juan Carlos Feb 10 '21 at 10:49
I don't really understand what the question is about. Can you give an example? – Yuriy Saraykin Feb 10 '21 at 11:04
As I comment before, sureley it´s my fault for not knowning, correctly, how percetiles are calculated. I suppose that the value of the percentile is different if I have, for example, the following size distribution: 10,11,12,13,14, 15 from this size distribution: 10,10, 11,11,12,12,13,13,14,14,15,15. That is, does the number of individuals with a given length play any role in calculating the value of the percentile?. This doubt is the reason for my question about the use of the number of specimens by length in calculating the percentile – Juan Carlos Feb 10 '21 at 12:27
Yes, it does. The nuances of calculating quantiles can be found in the function help `?quantile` – Yuriy Saraykin Feb 10 '21 at 12:35

Percentile in a data frame using two columns

2 Answers2