1

I want to find out if certain genes cluster together. Now, i already have a list of genes, and also their start and stop locations, and i already know how to calculate the distance between these genes. the problem is that i don't know how to take into account the switch in chromosomes.

You can't measure the distance between a gene on chromosome 1, and a gene on chromosome 2.

I thought of calculating the distance like this: start location of gene 2 - stop location of gene 1. Then, you have the distance between these genes.

But how do i account for this: when you reach the next chromosome, the R code grabs the start location of a gene on chromosome 2, but the stop location of a gene on chromosome 1, and this is not possible (for my research, at least).

So i am wondering how to account for that in R. I just need to somehow skip the genes if they are on different chromosomes.

I hope you guys can help me.

about the code below: the three vectors are just the vectors of start en stop locations, and the chromosomes. they are all of equal length. chromosomes is a vector containing the chromosome number for every gene

start_vector <- as.vector(sorted_coords$start_position)
end_vector <- as.vector(sorted_coords$end_position)
chromosomes <- as.vector(sorted_coords$chromosome_name)

chromosomes[is.na(chromosomes)] <- 24

count = 0
for(i in 1:length(chromosomes)){
  if(count != chromosomes[i]){
    start <- i - 1
    end <- i + 1

    start_vector <- start_vector[-start]
    end_vector <- end_vector[-end]
    count <- count + 1  
    }

}

I expect a vector of distances of all the genes, excluding the distances of genes which lie on different chromosomes.

Rene Bults
  • 135
  • 1
  • 4
  • 1
    Could you provide a reproducible example we can work with? This is a great post on how: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Sahir Moosvi Jan 09 '19 at 18:14
  • 1
    Yes, you have to iterate over the chromosomes using for loop (subset genes from specific chromosome and apply wanted function). Do you have a more specific question as this is not very related to R? You can also calculate gene density per chromosome (for me this seems more natural) – pogibas Jan 09 '19 at 19:01

1 Answers1

2
library(tidyverse) # for all the tidyverse goodies
library(reshape2) # For the melt function

As you did not provide a reproducible example, I took the liberty of making my own toy dataframe as below. It has only 2 chromosomes but this method should be applicable to an arbitrary number of chromosomes and genes.

sorted_coords <- tibble(start_position = abs(rnorm(10)*10),
                        end_position = abs(rnorm(10)*10),
                        chromosome_name = c(rep(1,5),rep(2,5)))

Edit: OP clarified they wanted to find the distance just to gene before not to every other gene. The method to do the latter part is at the bottom because I found it interesting. The new solution is here:

sorted_coords %>% 
  group_by(chromosome_name) %>%  
  arrange(chromosome_name, start_position) %>% 
  mutate(distance = start_position - lag(end_position, n = 1, default = 0))
  1. We group by chromosome so that we don't do any erroneous calculations between chromosomes.

  2. We arrange by chromosome name for sorting purposes at the end. We arrange by starting position so the genes are in the right order.

  3. We calculate the distance as proposed. The current row's start position - the previous row's end position. We specify (though it is the default) that we look at the row before, and that if there is no row before the value of the end position defaults to 0.


Old answer

If you want to compare every gene to every other gene, the fastest way to do this would be to create a matrix. As you specified, we want to subtract the start of gene 1 to the end of gene 2. That doesn't feel right to me but it's been a while since I did biochem :). Because you want a single list of pairs we can collapse it (melt function).

The code below is a little hazy to understand so let's break it down.

sorted_coords %>% 
  group_by(chromosome_name) %>% 
  do( outer(.$start_position, .$end_position) %>% 
        melt() %>% 
        setNames(c("rows", "columns", "distance")))
  1. We take our data frame and group it by each chromosome as you need.
  2. The do command lets us do complicated operations. The group_by command ensures that whatever we do is essentially isolated for each chromosome.
  3. The outer function creates our matrix for us. the . is the dataframe we pass (subsetted to a specific chromosome). We pass the two columns we need to find the difference for.
  4. The melt function turns the matrix into a dataframe for us specifying which two genes it used to calculate the difference. They are numerically listed and you can go back and compare it. I would suggest using arrange to set the order so you can reference it back easily.
  5. setNames just sets the column names to something more redable.

This should be much faster than running a for loop for all your processes. If you provide more info I can probable clean up the answer more.

Sahir Moosvi
  • 549
  • 2
  • 21
  • You may wish to add some filters such as filtering out rows where the row and column values are the same as it would be measuring the distance of a gene to itself. Or a filter to remove rows where the row value > column value. This is because this method does the distance of A to B as well as the distance of B to A (which won't be the same because we are using start and end positions). This however may be what you want. – Sahir Moosvi Jan 09 '19 at 20:24
  • Hey Sahir, Thanks for answering. i only think that there are a bit too many distances now. what i need is just the distance between two genes. have a look at this: https://ibb.co/QMbNVJB thats my dataframe of the locations i have. you created something similar, but still i think there are too many distances for the genes that i have. also, i dont need to calculate the distance between every gene, but just between 1 and 2, 2 and 3, 3 and 4, 4 and 5, etc. Allthough in advance, thank you very much – Rene Bults Jan 10 '19 at 09:38
  • Oh, if you just want to compare to the gene directly before then you can just use the lead/lag function in in dplyr. I'll update the answer when i get a chance for that. – Sahir Moosvi Jan 10 '19 at 17:41
  • You're welcome, updated the answer. Is this more what you were looking for? – Sahir Moosvi Jan 10 '19 at 18:04
  • @ReneBults Please paste your example data as text into your question, not images. See [here](http://stackoverflow.com/questions/5963269). – zx8754 Jan 11 '19 at 06:53
  • Sorry for the late reply, but yes, this is exactly what i needed. Thank you all for helping me and i am still sorry for not giving proper examples. Cheers! – Rene Bults Jan 21 '19 at 16:14