0

I need to create a scatterplot of a dictionary of DNA sequence IDs and molecular weights. Many of the DNA sequences are ambiguous, so they can have many possible molecular weights (and thus there are many values per key). The dictionary looks something like this but many of the keys actually have far more values (I've removed some for the sake of brevity).

{'seq_7009': [6236.9764, 6279.027699999999,
   6319.051799999999, 6367.049999999999],
 'seq_418': [3716.3642000000004, 3796.4124000000006],
 'seq_9143_unamb': [4631.958999999999],
 'seq_2888': [5219.3359, 5365.4089],
 'seq_1101': [4287.7417, 4422.8254]}

I have another function called get_all_weights that generates this dictionary, so I'm trying to call that function and then graph the results. This is what I have so far, based on another post on this site, but it doesn't work:

import matplotlib.pyplot as plt
import itertools

def graph_weights(file_name):
    with open (file_name) as file:
        d = {} # Initialize a dictionary and then fill it with the results of the get_all_weights function
        d.update(get_all_weights(file_name))  
        for k, v in d.items():
            x = [key for (key,values) in b.items() for _ in range(len(values))]
            y = [val for subl in d.values() for val in subl]
            ax.plot(x, y)
    plt.show()

Does anyone know how I can achieve this? The plot should show the sequence IDs on the x axis and the values on the y axis and it should make it clear that the same value can occur multiple times.

tdy
  • 36,675
  • 19
  • 86
  • 83

2 Answers2

3

and it should make it clear that the same value can occur multiple times

With default matplotlib plots, this won't be made clear since similar/identical points will overlap directly.

While it's possible to manually add jittering, the simplest way is to use seaborn's swarmplot or stripplot.

  1. Create a dataframe from_dict:

    import pandas as pd
    data = pd.DataFrame.from_dict(d, orient='index').T
    
    #     seq_7009    seq_418  seq_9143_unamb   seq_2888   seq_1101
    # 0  6236.9764  3716.3642        4631.959  5219.3359  4287.7417
    # 1  6279.0277  3796.4124             NaN  5365.4089  4422.8254
    # 2  6319.0518        NaN             NaN        NaN        NaN
    # 3  6367.0500        NaN             NaN        NaN        NaN
    
  2. Then use either swarmplot or stripplot:

    import seaborn as sns
    sns.swarmplot(data=data)
    

    import seaborn as sns
    sns.stripplot(data=data)
    

tdy
  • 36,675
  • 19
  • 86
  • 83
1

You plot each sequence ID and their respective values with the following code.

import matplotlib.pyplot as plt

d = {'seq_7009': [6236.9764, 6279.027699999999,
   6319.051799999999, 6367.049999999999],
 'seq_418': [3716.3642000000004, 3796.4124000000006],
 'seq_9143_unamb': [4631.958999999999],
 'seq_2888': [5219.3359, 5365.4089],
 'seq_1101': [4287.7417, 4422.8254]}

plt.figure(figsize=(15,5))
xlabels = []
for i, key in enumerate(d):
    if len(d[key])!=0:
        plt.scatter([i+1]*len(d[key]), d[key], c="#396B8B")
    xlabels.append(key)   
plt.xticks(list(range(1, len(xlabels)+1)), xlabels, rotation='horizontal')
plt.grid(axis="y")
plt.title("Molecular Weight by Sequence ID")
plt.ylabel("Molecular Weight")
plt.show()
BoomBoxBoy
  • 1,770
  • 1
  • 5
  • 23
  • You don't need to add 1 to the range and enumeration since they are never shown to the user directly – Mad Physicist Jan 10 '22 at 06:16
  • Also, `xlabels = list(d.keys())` – Mad Physicist Jan 10 '22 at 06:17
  • Thank you so much for the help. As you can probably tell, I'm quite new to Python. I tried this code and it works perfectly as you wrote it, but when I edit it so as to define the dictionary as (get_all_weights(file_name)), it throws the error "x and y must be the same size." Not sure why – user17660176 Jan 10 '22 at 09:53