1

I am trying to make a scatterplot from the items in the dictionary and need to compare them using seaborn.

The listed values, for each animal, need to be compared in the plot as a repeated number of base pairs [1000, 2000, 3000].

   x     y
1000    53
2000    69
3000     0
import seaborn as sns

dict_1={'cat': [53, 69, 0], 'cheetah': [65, 52, 28]}
dict_2={'cat': [40, 39, 10], 'cheetah': [35, 62, 88]}

sns.set_theme()

sns.relplot(
    data=dict_1,
    x="organism", y="CpG sites")

Technical explanation: the first dictionary is the original sequence and the second dictionary is the randomized sequence with the same ACGT content, the listed values need to be compared in the plot as repeated CG amount. In the original sequence for the first 1000 bp, CG repeats 53 times, in the randomized sequence CG repeats 40 times for the Cat, then in the 2000 bp it repeats for 69 in the original sequence, and for the randomized one it repeats for 39, etc..

For example: instead of 'tip' (x), 'CG value', which are listed in the dictionaries, every 1000 base pair instead of in 'total_bill' (y).

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158

1 Answers1

1
  • It will be easiest to combine the dictionaries into a pandas.DataFrame, and then update df with additional details organizing the data.
  • If the values in the dictionaries are of unequal length, as indicated in a comment, use Creating dataframe from a dictionary where entries have different lengths.
    • Create a DataFrame for eachdict as shown in the linked answer, and then use pd.concat again to combine each DataFrame.
  • Tested in python 3.11.2, pandas 2.0.0, seaborn 0.12.2
import pandas as pd
import seaborn as sns

# update data in dictionaries from a comment
original_sequence = {'cat': [67, 17, 0], 'cheetah': [67, 17, 11], 'chlamydia': [67, 17, 27, 37, 17], 'polarbear': [67, 17, 27, 37, 32, 0]}
randomized_sequence = {'cat': [71, 61, 0], 'cheetah': [58, 56, 26], 'chlamydia': [47, 43, 44, 42, 29], 'polarbear': [52, 44, 54, 43, 42, 1]}

# list of dicts
list_of_dicts = [original_sequence, randomized_sequence]

# combine the dicts into dataframes, assign a new column to distinguish each sequence, reset the index and use it as the base pair amount
df = (pd.concat([pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
                 .assign(Sequence=i) for i, data in enumerate(list_of_dicts)], ignore_index=False)
      .reset_index()
      .rename({'index': 'CG Amount'}, axis=1))

# Update the CG Amount column to correspond to the actual numbers
df['CG Amount'] = df['CG Amount'].add(1).mul(1000)

# seaborn works with DataFrames in a long form, so melt
df = df.melt(id_vars=['Sequence', 'CG Amount'], var_name='Organism', value_name='Repeats', col_wrap=2)

scatter

g = sns.relplot(data=df, x='CG Amount', y='Repeats', hue='Sequence', col='Organism')

enter image description here

bar

  • If you're comparing two sequences at discrete intervals, a barplot seems the better option.
g = sns.catplot(data=df, kind='bar', x='CG Amount', y='Repeats', hue='Sequence', col='Organism', col_wrap=2)

enter image description here

df before .melt

    CG Amount   cat  cheetah  chlamydia  polarbear  Sequence
0        1000  67.0     67.0       67.0         67         0
1        2000  17.0     17.0       17.0         17         0
2        3000   0.0     11.0       27.0         27         0
3        4000   NaN      NaN       37.0         37         0
4        5000   NaN      NaN       17.0         32         0
5        6000   NaN      NaN        NaN          0         0
6        1000  71.0     58.0       47.0         52         1
7        2000  61.0     56.0       43.0         44         1
8        3000   0.0     26.0       44.0         54         1
9        4000   NaN      NaN       42.0         43         1
10       5000   NaN      NaN       29.0         42         1
11       6000   NaN      NaN        NaN          1         1

df.head() after .melt

   Sequence  CG Amount Organism  Repeats
0         0       1000      cat     67.0
1         0       2000      cat     17.0
2         0       3000      cat      0.0
3         0       4000      cat      NaN
4         0       5000      cat      NaN

df.tail() after .melt

    Sequence  CG Amount   Organism  Repeats
43         1       2000  polarbear     44.0
44         1       3000  polarbear     54.0
45         1       4000  polarbear     43.0
46         1       5000  polarbear     42.0
47         1       6000  polarbear      1.0

Notes

  • If the values in dictionaries have the same length, use the following code to create df
dict_1 = {'cat': [53, 69, 0], 'cheetah': [65, 52, 28]}
dict_2 = {'cat': [40, 39, 10], 'cheetah': [35, 62, 88]}

list_of_dicts = [dict_1, dict_2]

df = (pd.concat([pd.DataFrame(d, index=range(1000, 4000, 1000)).assign(Sequence=i) for i, d in enumerate(list_of_dicts)],
                ignore_index=False)
      .reset_index()
      .rename({'index': 'CG Amount'}, axis=1))
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158