2

I am trying to count the number of paralogues for the mouse homologues of the human protein-coding genes using BioMart. But for example in the 'PLIN4' gene its counting 35,000 paralogues instead of 4.

We think it is because some genes have one to many paralogues which causes repeats. When I run a single gene its gives me back the correct number of paralogues. Is there a way to either remove these repeats from the results or a way around this so that BioMart doesn't output these repeats.

I have also thought of maybe running one gene at a time, then counting it by setting up some sort of loop so that it does all of the genes from the list automatically.

The code I have written so far is:

# Load the biomaRt package:

library(biomaRt)
ensembl_hsapiens <- useMart("ensembl", 
                          dataset = "hsapiens_gene_ensembl")
ensembl_mouse <- useMart("ensembl", 
                       dataset = "mmusculus_gene_ensembl")

# Get all human protein coding genes:

hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id", 
                                         "external_gene_name"), 
                          filters = "biotype", 
                          values = "protein_coding", 
                          mart = ensembl_hsapiens)


ensembl_gene_ID <- hsapien_PC_genes$ensembl_gene_id

# Get mouse homologues

mouse_homologues <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", 
                                       "mmusculus_homolog_associated_gene_name"), 
                        filters = "ensembl_gene_id", 
                        values = c(ensembl_gene_ID), 
                        mart = ensembl_hsapiens)

# Get mouse external gene name 

mouse_homologues_external_gene_names <- mouse_homologues$mmusculus_homolog_associated_gene_name


mouse_paralogues <- getBM(attributes = c("hsapiens_homolog_associated_gene_name",
                                       "external_gene_name",
                                       "mmusculus_paralog_associated_gene_name"), 
                        filters = "external_gene_name", 
                        values = c(mouse_homologues_external_gene_names) , mart = ensembl_mouse)

# Remove genes with no paralogues 
mouse_paralogs_data <- mouse_paralogues[!(is.na(mouse_paralogues$mmusculus_paralog_associated_gene_name)
                                          | 
mouse_paralogues$mmusculus_paralog_associated_gene_name==""), ]

# Count paralogues per gene

library(plyr)
count_mouse_paralogues <- count(mouse_paralogs_data, "external_gene_name")
View(count_mouse_paralogues)

Hope someone can help

Thanks

Jack

Jack Dean
  • 163
  • 1
  • 7
  • I'm voting to close this question as off-topic because it seems like a better fit for https://bioinformatics.stackexchange.com/ – Stedy Jan 27 '18 at 05:37
  • Your code works fine for me, and it is giving 4 paralogues for gene *PLIN4*. Please check your code again, and clarify why it is not working as expected, also provide output of `sessionInfo()` (R version, biomar version). – zx8754 Jan 28 '18 at 20:23
  • Possible duplicate of https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame – zx8754 Jan 28 '18 at 20:29

1 Answers1

1

In addition to the comment, here is what I am getting:

# R version 3.4.2 (2017-09-28)
library(biomaRt) # biomaRt_2.32.1
library(dplyr)   # dplyr_0.7.4

# test how many paralogues for gene PLIN4
nrow(mouse_paralogs_data[
  mouse_paralogs_data$hsapiens_homolog_associated_gene_name == "PLIN4", ])
# [1] 4

# now summarise for all genes
res <- mouse_paralogs_data %>% 
  group_by(hsapiens_homolog_associated_gene_name) %>% 
  summarise(DistinctP = n_distinct(mmusculus_paralog_associated_gene_name))

# test again number of paralogues for gene PLIN4
res[ res$hsapiens_homolog_associated_gene_name == "PLIN4", ]
# # A tibble: 1 x 2
#   hsapiens_homolog_associated_gene_name DistinctP
#   <chr>                                     <int>
# 1 PLIN4                                         4
zx8754
  • 52,746
  • 12
  • 114
  • 209