-1

I have a genetic dataset of chromosome positions, I am trying to use the positions to find gene lengths. For example:

#Input:
Chr  Start  End   Genes
1      1     2    Gene1
1      3     4    Gene1
1      5     9    Gene2 
2      1     3    Gene3


#Expected output calculating gene lengths:
Chr  Start  End   Genes     Length
1      1     2    Gene1      3     
1      3     4    Gene1      3
1      5     9    Gene2      4
2      1     3    Gene3      2

So I am looking to find for each gene the maximum End value minus the minimum Start value and put that value in a new Length column.

I've been trying to go about this with something like:

test <- df %>%
  group_by(Genes) %>%
  df$Length = (min(df$Start) - max(df$End))

I've also been trying to find a data.table solution (as my real data is very big) but I am not experienced.

DN1
  • 234
  • 1
  • 13
  • 38
  • Are you sure your expected lengths are correct? For Gene1 `max(End) = 4` and `min(Start) = 1`. Also note you need to wrap the calculations in `mutate()` and not use `df$` every time. Just the variable names are needed, i.e. `.... %>% mutate(Length = (min(Start) - max(End)))` – Sotos Jul 24 '20 at 09:12
  • Possible duplicate of https://stackoverflow.com/questions/40570221/calculate-difference-between-dates-by-group-in-r – akrun Jul 24 '20 at 22:31

1 Answers1

1

Perhaps, you were trying to do :

library(dplyr)
df %>% group_by(Genes) %>% mutate(Length = max(End) - min(Start))

which in `data.table is :

library(data.table)
setDT(df)[, Length :=  max(End) - min(Start), Genes]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213