# Import pandas library
import pandas as pd
initialize list of lists
data = {'person_id':[1,1,2,1,2,3,4,5,6,5,6,4,5,4,7,8,8,9,10,1,10,1,10,9,8,7,6,5,4,2], 'condition_concept_id':[43927,5234, 1111, 2222, 5234, 4444000, 67675, 43927, 67890, 5234, 12345,12345, 4444000, 45670, 5234, 67890, 43927,45670, 5234, 12345, 4444000, 1111, 45670, 43927, 67675, 45670, 43927, 67675, 45670,2222] ,'standard_concept_name':['covid', 'diabetes', 'alpha viruses', 'alcohol cirrhosis', 'diabetes', 'alzheimer', 'arthitis', 'covid', 'bladder infection', 'diabetes', 'carbon monoxide poisioning', 'carbon monoxide poisioning', 'alzheimer', 'celiac disease', 'diabetes','bladder infection', 'covid', 'celiac disease', 'diabetes', 'carbon monoxide poisioning', 'alzheimer', 'alpha viruses', 'celiac disease', 'covid','arthitis', 'celiac disease', 'covid', 'arthitis', 'celiac disease','alcohol cirrhosis']}
Create the pandas DataFrame
df = pd.DataFrame(data)
print dataframe
df
I have this dataframe. My end goal is to generate a UMAP plot to see the disease clustering pattern. For example - people with the same disease to be clustered closely and vice-versa.
Using pivot function I could contain multiple entries of one person_id to a single row.
sampledf=df.pivot_table(index = 'person_id', columns='standard_concept_name',values ='condition_concept_id', aggfunc='count')
I am wondering what would be the next step in the piepline? Please advice.
I am expecting a UMAP plot demonstrating the disease cluster. But I don't know how to get to this point.