I'm working on a similarity computing function on an RDD containing protein names and their domains.
Actually i used the cartisian function to determine all possible pairs in my rdd, they look like
((**P29535**,IPR004839;IPR004838;IPR015424;IPR015422;IPR0154),(**A6MML6**,IPR034733;IPR000438;IPR029045;IPR0117))
(PS: its just an example, the resulting pairs that i found are millions..)
the words written in bold are the proteins names and the rest is their domains. can you please help me to determine the similarity degree between them based on their domains?
i would like to have a result such as:
*protein_name1* + "and" + *protein_name2* + "have a similiarity degree equals to:" + *similarity*