Questions TL;DR
- How can I change my
get_top_genes
function so it will accept 3 arguments: a dataframe, the desired column, and the number of rows returned for eachsample_name
? - (Bonus question) Why do
top_n()
,slice_head()
&slice_max()
not return the desired output? (2 rows per group)?
Introduction
I have a very large dataframe of the following structure:
Rows: 0
Columns: 7
$ transcript <chr>
$ group <fct>
$ sample_name <fct>
$ transcript_biotype <chr>
$ description <chr>
$ counts_total <int>
$ counts_unmod <int>
For each "sample_name", I want to extract certain amount of "transcripts" with highest values in "counts_total". Let's say, for example, that I want to extract 100 "transcripts" per each "sample_name".
For this purpose, I wrote a small function:
get_top_genes <- function(df,column, numb){
# remove rows with NAs
df <- df %>% tidyr::drop_na()
#arrange & split by group
group_genes <- df %>% dplyr::group_by(group) %>%
dplyr::arrange(desc(column)) %>% dplyr::top_n(numb) %>%
dplyr::group_split(group)
#extract names
names(group_genes) <- lapply(group_genes, function(x) as.character(x$group[1]))
#output <- lapply(group_genes, function(x) x$external_gene_name)
return(group_genes)
}
This function takes as arguments the bespoke dataframe (df), colname of the column of interest (column), and the number of values to be returned for each sample_name (numb).
This is an example function call:
get_top_genes(df, "counts_total", 100)
Unfortunately, this function does not work correctly, since it was supposed to return exactly 100 top values in the selected column for each sample, but it does not. In some cases there are less values than expected. The dataframe is large, it contains entire transcriptomes of multicellular species, there are thousands of transcripts per each sample_name, so there is no possibility that there are not enough transcripts in some samples.
If I replace the code after the "#arrange & split by group" comment with following snippet:
group_genes <- df %>% dplyr::group_by(sample_name) %>% dplyr::arrange(desc("counts_total")) %>% dplyr::slice_head(n=2) %>% dplyr::group_split(sample_name)
Where slice_head() is used instead of top_n(), I obtain correct results (2 rows per group). I do not know why is that be. I also would like my function to accept number of returned values as an argument, not have it hardcoded, as I did in case of slice_head() version of my function.
I also tried slice_max() approach, but it also did not provide satisfactory results (there were unequal amounts of rows per sample_name).
similar question I found this thread: Select the top N values by group and I decided to stick to the dplyr solutions, since I am most familiar with this package. However, this thread does not answer why those functions do not behave as expected.
Bonus question: why top_n(), slice_head() & slice_max() do not return desired output?
Dummy data
Since I can't dput() this large dataset, I am providing code to generate the dummy input:
set.seed(0)
num_transcripts_per_group <- 100
num_sample_names_per_group <- 3
transcript_shared_percentage <- 0.5
common_transcripts <- paste0("Transcript_", 1:num_transcripts_per_group)
unique_transcripts <- list(
group1 = lapply(1:num_sample_names_per_group, function(i) {
setdiff(common_transcripts, sample(common_transcripts, round(transcript_shared_percentage * num_transcripts_per_group)))
}),
group2 = lapply(1:num_sample_names_per_group, function(i) {
setdiff(common_transcripts, sample(common_transcripts, round(transcript_shared_percentage * num_transcripts_per_group)))
})
)
df <- expand.grid(
transcript = c(unlist(unique_transcripts$group1), unlist(unique_transcripts$group2)),
group = factor(rep(c("group1", "group2"), each = num_sample_names_per_group)),
sample_name = factor(rep(paste0("Sample_", 1:num_sample_names_per_group), 2)),
transcript_biotype = rep("biotype", 2 * num_sample_names_per_group),
description = rep("Transcript description", 2 * num_sample_names_per_group)
)
unique_counts_total <- sample(1:1000, num_transcripts_per_group, replace = TRUE)
unique_counts_unmod <- sample(5:500, num_transcripts_per_group, replace = TRUE)
df$counts_total <- sample(unique_counts_total, nrow(df), replace = TRUE)
df$counts_unmod <- sample(unique_counts_unmod, nrow(df), replace = TRUE)
rownames(df) <- NULL
df <- df %>% mutate(sample_name = ifelse(group == "group2", paste0("sample", 4:6), as.character(sample_name)))
print(df)