0

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.

Example of dataframe A dfA

Example of dataframe B dfB

Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.

Looking something like this...

Example of Ideal Solution

However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.

Example of Acceptable Solution

So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.

I have had partial success with the acceptable solution approach and using:

dfA <- dfA %>% mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])

The solution seems okay, however, it does return an error along the lines of problem with mutate() column GO_Cat_1. i GO_Cat_1 = ...[]. i longer object length is not a multiple of shorter object length

I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.

Any assistance is greatly appreciated!

  • Welcome to StackOverflow, [please see here on how to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Please avoid posting images of data or code. – jrcalabrese Dec 13 '22 at 23:31

0 Answers0