0

Questions TL;DR

  1. How can I change my get_top_genes function so it will accept 3 arguments: a dataframe, the desired column, and the number of rows returned for each sample_name?
  2. (Bonus question) Why do top_n(), slice_head() & slice_max() not return the desired output? (2 rows per group)?

Introduction

I have a very large dataframe of the following structure:

Rows: 0
Columns: 7
$ transcript         <chr> 
$ group              <fct> 
$ sample_name        <fct> 
$ transcript_biotype <chr> 
$ description        <chr> 
$ counts_total       <int> 
$ counts_unmod       <int> 

For each "sample_name", I want to extract certain amount of "transcripts" with highest values in "counts_total". Let's say, for example, that I want to extract 100 "transcripts" per each "sample_name".

For this purpose, I wrote a small function:

get_top_genes <- function(df,column, numb){
  
  # remove rows with NAs
  df <- df %>% tidyr::drop_na()
  
  #arrange & split by group
  group_genes <- df %>% dplyr::group_by(group) %>% 
    dplyr::arrange(desc(column)) %>% dplyr::top_n(numb) %>% 
    dplyr::group_split(group)
  
  #extract names
  names(group_genes) <- lapply(group_genes, function(x) as.character(x$group[1]))
  
  #output <- lapply(group_genes, function(x) x$external_gene_name)
  
  return(group_genes)
}

This function takes as arguments the bespoke dataframe (df), colname of the column of interest (column), and the number of values to be returned for each sample_name (numb).

This is an example function call:

get_top_genes(df, "counts_total", 100)

Unfortunately, this function does not work correctly, since it was supposed to return exactly 100 top values in the selected column for each sample, but it does not. In some cases there are less values than expected. The dataframe is large, it contains entire transcriptomes of multicellular species, there are thousands of transcripts per each sample_name, so there is no possibility that there are not enough transcripts in some samples.

If I replace the code after the "#arrange & split by group" comment with following snippet:

group_genes <- df %>% dplyr::group_by(sample_name) %>% dplyr::arrange(desc("counts_total")) %>% dplyr::slice_head(n=2) %>% dplyr::group_split(sample_name)

Where slice_head() is used instead of top_n(), I obtain correct results (2 rows per group). I do not know why is that be. I also would like my function to accept number of returned values as an argument, not have it hardcoded, as I did in case of slice_head() version of my function.

I also tried slice_max() approach, but it also did not provide satisfactory results (there were unequal amounts of rows per sample_name).

similar question I found this thread: Select the top N values by group and I decided to stick to the dplyr solutions, since I am most familiar with this package. However, this thread does not answer why those functions do not behave as expected.

Bonus question: why top_n(), slice_head() & slice_max() do not return desired output?

Dummy data

Since I can't dput() this large dataset, I am providing code to generate the dummy input:

set.seed(0)

num_transcripts_per_group <- 100
num_sample_names_per_group <- 3
transcript_shared_percentage <- 0.5

common_transcripts <- paste0("Transcript_", 1:num_transcripts_per_group)

unique_transcripts <- list(
  group1 = lapply(1:num_sample_names_per_group, function(i) {
    setdiff(common_transcripts, sample(common_transcripts, round(transcript_shared_percentage * num_transcripts_per_group)))
  }),
  group2 = lapply(1:num_sample_names_per_group, function(i) {
    setdiff(common_transcripts, sample(common_transcripts, round(transcript_shared_percentage * num_transcripts_per_group)))
  })
)

df <- expand.grid(
  transcript = c(unlist(unique_transcripts$group1), unlist(unique_transcripts$group2)),
  group = factor(rep(c("group1", "group2"), each = num_sample_names_per_group)),
  sample_name = factor(rep(paste0("Sample_", 1:num_sample_names_per_group), 2)),
  transcript_biotype = rep("biotype", 2 * num_sample_names_per_group),
  description = rep("Transcript description", 2 * num_sample_names_per_group)
)

unique_counts_total <- sample(1:1000, num_transcripts_per_group, replace = TRUE)
unique_counts_unmod <- sample(5:500, num_transcripts_per_group, replace = TRUE)

df$counts_total <- sample(unique_counts_total, nrow(df), replace = TRUE)
df$counts_unmod <- sample(unique_counts_unmod, nrow(df), replace = TRUE)

rownames(df) <- NULL

df <- df %>% mutate(sample_name = ifelse(group == "group2", paste0("sample", 4:6), as.character(sample_name)))

print(df)
Mark
  • 7,785
  • 2
  • 14
  • 34
ramen
  • 691
  • 4
  • 20
  • I pulled out the tl;dr to the place where people will read it. I still think this could do with editing down a lot – Mark Jul 22 '23 at 12:31
  • also what is the purpose of the two groups in the output of `get_top_genes(df, "counts_total", 100)`? – Mark Jul 22 '23 at 12:37
  • @Mark it is required for the downstream application. – ramen Jul 26 '23 at 07:05

1 Answers1

1

First of all, you should try to make a minimal, reproducible example. Your question is about dplyr functions and so you should use a standard dataset to illustrate what you're trying to do. There is no need to include anything about genes.

It seems like you're encountering two problems. The first is that dplyr::top_n() is confusing. That's why it was superseded in favor of slice_*** functions. When you use slice_head, make sure you name the n parameter. Your second problem seems to be about passing the column name into the function.

library(tidyverse)

get_top <- function(df, column, numb){
    df %>% 
        group_by(model) %>% 
        arrange(desc({{column}})) %>% 
        slice_head(n = numb) %>% 
        group_split()
}

mpg %>% get_top(cty, 2)
mpg %>% get_top(displ, 3)

Here I am passing the column name without quotation marks. If you need to pass a string then you can use

get_top2 <- function(df, column, numb){
    df %>% 
        group_by(model) %>% 
        arrange(desc(.[[column]])) %>% 
        slice_head(n = numb) %>% 
        group_split()
}

mpg %>% get_top2("cty", 2)
mpg %>% get_top2("displ", 3)

Finally, the reason slice_max({{column}}, n=100) doesn't give what you want is that it looks for the 100 largest values of your chosen column, and then takes all rows with those top values. So if you have any ties then you will receive more than 100 rows. Using slice_head(n=100) gives you the top 100 rows using whatever sorting was in place before you pipe into it.

It's probably best to just forget about top_n. But since you're curious, the reason it doesn't work is because it ignores your arrange and sorts the data on its own, using the third parameter wt. Since you didn't specify the wt parameter, it defaulted to using the last column in the data frame.

In your specific case, you should use:

get_top_genes <- function(df,column, numb){
    
    # remove rows with NAs
    df <- df %>% tidyr::drop_na()
    
    #arrange & split by group
    group_genes <- df %>% dplyr::group_by(group) %>% 
        dplyr::arrange(desc({{column}})) %>% dplyr::slice_head(n = numb) %>% 
        dplyr::group_split(group)
    
    #extract names
    names(group_genes) <- lapply(group_genes, function(x) as.character(x$group[1]))
    
    #output <- lapply(group_genes, function(x) x$external_gene_name)
    
    return(group_genes)
}
Michael Dewar
  • 2,553
  • 1
  • 6
  • 22
  • I provided a code to produce dummy in my question. I tried to use some standard data, but with them I was unable to reproduce the error. The R built-in datasets behaved exactly as supposed to, which is why I provided something similar to the actual input I am dealing with. I know the error occurs, since when I group split my data and process each sub-dataframe separately, I receive result consistent with expectations. On the other hand I would like to be able to do so without the wall of code (in real life I have large datasets, which would put heavy load on memory if group split). – ramen Jul 26 '23 at 07:09
  • Did you try my answer? I think it explains exactly where your problem is. I updated it to show exactly what `get_top_genes()` should be. – Michael Dewar Jul 26 '23 at 08:55