0

Good Morning Everyone,

I have a small issue regarding a dataframe :

I have 165 differents countries, sometimes with more than 30 occurencies. What I would like to do is take only 30 occurencies for each country, and then apply the mean function on the related variables.

Do you have any idea how I can achieve this?

Here is the dataframe :

Dataframe

Thanks for your answer,

Rémi

s__
  • 9,270
  • 3
  • 27
  • 45
Rémi Pts
  • 59
  • 1
  • 7
  • 3
    Please post a [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) example, rather than links to images which we cannot run in our own session – Conor Neilson Jan 04 '19 at 10:38

1 Answers1

0

Assuming you want to take out 30 rows for each group, we can do the following. Unfortunately, dplyr's sample_n cannot handle when the input data frame has less rows than you want to sample (unless you want to sample with replacement).

Where df is your data.frame:

Solution 1:

library(dplyr)
df %>% group_by(Nationality) %>%
  sample_n(30, replace=TRUE) %>%
  distinct() %>% # to remove repeated rows where nationalities have less than 30 rows
  summarise_at(vars(Age, Overall, Passing), funs(mean))

Solution 2:

df %>% split(.$Nationality) %>%
  lapply(function(x) {
    if (nrow(x) < 30)
      return(x)
    x %>% sample_n(30, replace=FALSE)
  }) %>%
  do.call(what=bind_rows) %>%
  group_by(Nationality) %>%
  summarise_at(vars(Age, Overall, Passing), funs(mean))

Naturally without guarantee as you did not supply a working example.

MrGumble
  • 5,631
  • 1
  • 18
  • 33