0

I have table like this and would like to count genes that appear the most (lets say top 10 genes) and then find out mean of tail_len for those top 10 genes.

gene tail_len
1 SPAC20G4.06c 3
2 SPCC613.06 5
3 SPAC6F6.03c 2
4 SPAC20G4.06c 3
5 SPBC23G7.15c 5
6 SPAC589.10c 2
7 SPBC23G7.15c 3
8 SPAC22H12.04c 1
9 SPAC22H12.04c 12
10 SPAC6G10.11c 8
11 SPAC589.10c 31
12 SPBC18E5.06 16
bobyca
  • 3
  • 2

2 Answers2

0

It's tough to test without a larger dataset. But this would be an approach using dplyr:

library(tidyverse)

foo <- tibble(
  gene = c("SPAC20G4.06c", "SPCC613.06", "SPAC20G4.06c", "SPAC6F6.03c", "SPBC23G7.15c"), 
  tail_len = c(3, 5, 1, 6, 7)
)

foo_top10 <- foo %>%
  group_by(gene) %>%
  summarize(count = n()) %>%
  slice_max(count, n = 10)

foo %>% 
  filter(gene %in% foo_top10$gene) %>%
  group_by(gene) %>%
  summarize(tail_len_mean = mean(tail_len))
Harrison Jones
  • 2,256
  • 5
  • 27
  • 34
  • The original dataset has over 7 million rows lol. I managed to solve this at the end with aggregate. – bobyca Jan 20 '23 at 21:58
0

Here is a way with slice_max. I have defined two variables, ties_ok and max_n. The latter is set to 3 to test the code, you want max_n <- 110, the former can bee set to FALSE if you want to discard ties and keep only the first rows found.

df1 <- "    gene    tail_len
1   SPAC20G4.06c    3
2   SPCC613.06  5
3   SPAC6F6.03c     2
4   SPAC20G4.06c    3
5   SPBC23G7.15c    5
6   SPAC589.10c     2
7   SPBC23G7.15c    3
8   SPAC22H12.04c   1
9   SPAC22H12.04c   12
10  SPAC6G10.11c    8
11  SPAC589.10c     31
12  SPBC18E5.06     16"
df1 <- read.table(text = df1, header = TRUE)

suppressPackageStartupMessages(
  library(dplyr)
)

ties_ok <- TRUE
#ties_ok <- FALSE
max_n <- 3L
df1 %>%
  group_by(gene) %>%
  summarise(count = n(), mean_tail_len = mean(tail_len)) %>%
  slice_max(count, n = max_n, with_ties = ties_ok) %>%
  select(-count)
#> # A tibble: 4 × 2
#>   gene          mean_tail_len
#>   <chr>                 <dbl>
#> 1 SPAC20G4.06c            3  
#> 2 SPAC22H12.04c           6.5
#> 3 SPAC589.10c            16.5
#> 4 SPBC23G7.15c            4

Created on 2023-01-20 with reprex v2.0.2

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66