0

I'm trying to customize a phylogenetic tree based on a tree file and a dataframe. The tree file has the same data in terms of ID, for example, GCA_021406745.1_ASM2140674v1 is in this file and in the data frame. Dataframe looks like this:

GCA_000375645.1_ASM37564v1  20
GCA_900543265.1_UMGS547 20
GCA_000614355.1_ASM61435v1  7
GCA_000766005.1_ASM76600v1  7

Where the second column is the cluster value. This value is important because I want to use this value to customize the labels of my phylogenetic tree, for example, "1" = red, "2" = green, and so on. To do that, I'm using a python program for phylogenetic tree manipulation: Toytree https://toytree.readthedocs.io/en/latest/index.html

Specifically, I'm using tip_labels_colors to customize the labels. For example, with this example (https://toytree.readthedocs.io/en/latest/8-styling.html#Node-labels-styling) you can do that task by making a list of hex color values based on tip labels:

colorlist = ["#d6557c" if "rex" in tip else "#5384a3" for tip in rtre.get_tip_labels()]
rtre.draw(
    tip_labels_align=True,
    tip_labels_colors=colorlist
);

That if statement is based on the condition if "rex" is in the label. Now, I want to do the same based on my data frame, but using the cluster value. I'm thinking of doing the same color_list but with a color for each cluster value. I have not been able to do that successfully, so I need some help with maybe an idea or pseudocode. Here is a minimal example, using data from toytree:

import toytree
import toyplot
import numpy as np

# a tree to use for examples
url = "https://eaton-lab.org/data/Cyathophora.tre"
rtre = toytree.tree(url).root(wildcard='prz')

Using these lines, you can customize the labels of the tree with two different colors.

# make list of hex color values based on tip labels
colorlist = ["#d6557c" if "rex" in tip else "#5384a3" for tip in rtre.get_tip_labels()]
rtre.draw(
    tip_labels_align=True,
    tip_labels_colors=colorlist
);

The example used the condition "rex" in the label to color the label with a specific color. Well, I need help with that because I need to color my labels based on my data frame values (cluster values).

Mauri1313
  • 345
  • 4
  • 12
  • So you are asking how to use the *second* column of a DataFrame to specify a color for the Node ID specified in the first column? Does `get_tip_labels()` return a list of ID's that can be matched to the first column? Which part are you having trouble with? Your question is a little too broad. Are you asking how to filter the DataFrame by ID? Or are you asking how to map a cluster value to a hex color? Please focus the question to a single specific problem. It will probably help to include a minimal example of `rtre.get_tip_labels()`'s return value. – wwii Jun 23 '22 at 15:13
  • Thank for your reply, I update the post to include a minimal example from the toytree data example, so I need to how to map a cluster value to a hex color. Maybe I was not clear enough, sorry for that. Please let me know if you understand my idea – Mauri1313 Jun 23 '22 at 15:22
  • Does [How do I select rows from a DataFrame based on column values?](https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values) answer your question? To map a cluster value to a hex color use a dictionary `{cluster_value:hex_color,...}` – wwii Jun 23 '22 at 15:59
  • Does `rtre.get_tip_labels()` return a sequence of ID's? Please read [mre] - we should not have to go to an offsite resource to retrieve data needed to reproduce your problem. – wwii Jun 23 '22 at 16:01
  • Yes, for example: ```'GCA_900757995.1_ERS537409_6', 'GCA_900538385.1_UMGS18',...``` – Mauri1313 Jun 23 '22 at 16:03

1 Answers1

1
  • make a dictionary mapping values to colors
     colormap = {20:"#d6557c", 7:"#5384a3",...}
  • iterate over rtre.get_tip_labels() return value : for ID in rtre.get_tip_labels():
  • for each item filter the DataFrame using the ID and get the cluster value
    cluster_value = df.loc[df['ID'] == ID,'cluster_value_column_name']
  • Use the cluster value to get the color
    color = colormap[cluster_value]
  • accumulate the colors in a list.

The colors can be added to the DataFrame using Series.map

df['colors'] = df['cluster_value_column_name'].map(colormap)

The DataFrame could be sorted to the same order as rtre.get_tip_labels() and df['colors'].to_list() could be used.

Some sorting methods...
sorting by a custom list in pandas
Sort column in Pandas DataFrame by specific order
Sorting a pandas DataFrame by the order of a list

wwii
  • 23,232
  • 7
  • 37
  • 77