Add new column to dataframe with label based on if one column value is in between (range) of two other column values

Question

I have a dataframe (with some ~300 rows) where one column is called "geneID":

geneID   distance  pvalue
4        30        0.05
409      0         0.001
60       41        0.02
...

I have a second dataframe that indicates the range of genes that comprise a larger antibiotic biosynthetic gene cluster (there are about 30 gene clusters in the chromosome):

ClusterID           start   end
Chloramphenicol     100     130
NRPS                403     489
Terpene             5021    5109
...

What I want to do is add another column to dataframe 1 labeled with the corresponding "clusterID" of dataframe 2 if the geneID is between the "start" and "stop" of of that gene cluster:

geneID   distance  pvalue  ClusterID
4        30        0.05    NA
409      0         0.001   NRPS
60       41        0.02    NA

I've tried using vectors as values in a mutate function:

ChIP_table %>%
  mutate(ClusterID = case_when((ID >= biosynthetic_clusters$start & ID <= biosynthetic_clusters$end) ~ biosynthetic_clusters$Cluster,
                               TRUE ~ "NA"))

which didn't work. Not sure where to go from here. I've tried building a for loop but still couldn't figure out a way to use a vector/column value as conditions to sort/label.

Any help would be appreciated!

Is [this question](https://stackoverflow.com/questions/57861055/how-can-i-use-mutate-and-case-when-in-a-for-loop) helpful? — ornaldo_, Mar 30 '21 at 19:52

SteveM · Answer 1 · 2021-03-30T21:53:57.353

You could use the cut function. Say your dataframe is df:

breaks <- c(100, 130, 403, 489, 5021, 5109)
labels <- c("Chloramphenicol", NA, "NRPS", NA, "Terpene")

df$ClusterID <- cut(df$geneID, breaks = breaks, labels = labels, include.lowest = TRUE)

The breaks are the start, end values. The labels are the ClusterID names for each feasible range. The NA labels are for the feasible range gaps. So for the geneID's that fall inside the ClusterID ranges, they will be assigned the ClusterID name, otherwise NA. So some up front grunt work to type in the labels vector. (You could write a function to do that.) But I think it would work.

score 1 · Answer 2 · answered Mar 30 '21 at 22:20

We can use case_when from dplyr package.

library(dplyr)

df1 %>% 
  mutate(clusterID = case_when(geneID > df2$start & geneID < df2$end ~ df2$ClusterID))

Output:

  geneID distance pvalue clusterID
   <dbl>    <dbl>  <dbl> <chr>    
1      4       30  0.05  NA       
2    409        0  0.001 NRPS     
3     60       41  0.02  NA

Data:

df1 <- tribble(
~geneID,   ~distance,  ~pvalue,
4, 30, 0.05,
409, 0, 0.001, 
60, 41, 0.02)

df2 <- tribble(
~ClusterID, ~start, ~end,
"Chloramphenicol", 100, 130, 
"NRPS", 403, 489, 
"Terpene", 5021, 5109)

Add new column to dataframe with label based on if one column value is in between (range) of two other column values

2 Answers2

Linked