Pipeline for UMAP Clustering

Question

# Import pandas library

import pandas as pd

initialize list of lists

data = {'person_id':[1,1,2,1,2,3,4,5,6,5,6,4,5,4,7,8,8,9,10,1,10,1,10,9,8,7,6,5,4,2], 'condition_concept_id':[43927,5234, 1111, 2222, 5234, 4444000, 67675, 43927, 67890, 5234, 12345,12345, 4444000, 45670, 5234, 67890, 43927,45670, 5234, 12345, 4444000, 1111, 45670, 43927, 67675, 45670, 43927, 67675, 45670,2222] ,'standard_concept_name':['covid', 'diabetes', 'alpha viruses', 'alcohol cirrhosis', 'diabetes', 'alzheimer', 'arthitis', 'covid', 'bladder infection', 'diabetes', 'carbon monoxide poisioning', 'carbon monoxide poisioning', 'alzheimer', 'celiac disease', 'diabetes','bladder infection', 'covid', 'celiac disease', 'diabetes', 'carbon monoxide poisioning', 'alzheimer', 'alpha viruses', 'celiac disease', 'covid','arthitis', 'celiac disease', 'covid', 'arthitis', 'celiac disease','alcohol cirrhosis']}

Create the pandas DataFrame

df = pd.DataFrame(data)

print dataframe

df

I have this dataframe. My end goal is to generate a UMAP plot to see the disease clustering pattern. For example - people with the same disease to be clustered closely and vice-versa.

Using pivot function I could contain multiple entries of one person_id to a single row.

sampledf=df.pivot_table(index = 'person_id', columns='standard_concept_name',values ='condition_concept_id', aggfunc='count')

I am wondering what would be the next step in the piepline? Please advice.

I am expecting a UMAP plot demonstrating the disease cluster. But I don't know how to get to this point.

Pipeline for UMAP Clustering

# Import pandas library

initialize list of lists

Create the pandas DataFrame

print dataframe

0 Answers0