0

I am trying to create a box plot with multiple data series and categories, so something like this:

The data I have is several files such that each file contains one of the series (e.g. 'high' and 'low'). For each file I have several thousand lines of tuples containing a string and an int, e.g.

('HHFRVEHAVAEGAK', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('IKEEAVKEKSPSLGK', '3')
('ALLHTVTSILPAEPEAE', '2')
('VAVPTGPTPLDSTPPGGAPHPLTGQEEARAVEK', '5')

I would like to plot the occurrence distribution of the characters in these sequences.

class MyObj(object):

    __slots__ = ['name', 'seqs', 'charges']

    def __init__(self, name, tuples):
        self.name = name
        self.seqs = set()

        seqs, zs = zip(*tuples)
        self.seqs.update(seqs)
        #self.charges = collections.Counter(zs)
        self.charges = zs

data = {}
inf = ['high_corr.txt', 'low_corr.txt']
names = ['high', 'low']
for i, somefile in enumerate(inf):
    with open(somefile, 'r') as f:
        entries = [literal_eval(line.strip()) for line in f]
        index = names[i] if names else f"File{i}"
        data[index] = MyObj(index, entries)

    def getCounts(seq):
        c = collections.Counter(seq)
        return {aa: c[aa] for aa in seq}

    d = {name: [getCounts(s) for s in pc.seqs] for name, pc in data.items()} # <- tried dict comprehension as well
    df = pd.DataFrame.from_dict(d, orient='index')
    df = df.transpose()

So when I am done reading the files, I get something like this: enter image description here

As you can see I cannot get the individual characters out, they are read as dicts, and thus do not get plotted.

Is there a way I can break the letters out, and have them as a third column, like in the example in the linked question? To reiterate, what I want to achieve is a boxplot with letters on the x-axis, and two boxes drawn (high and low) for each letter.

posdef
  • 6,498
  • 11
  • 46
  • 94
  • Have you looked at ```MultiIndex.from_tuples``` – liam Sep 25 '17 at 09:10
  • @LiamHealy i have looked at the documentation a bit, it appears relevant but it's not immediately clear to me how I could get the data out of its current form and into a list of tuples to create the index and then read the values. – posdef Sep 25 '17 at 09:58
  • 1
    Please provide a [mcve] of the issue. Otherwise it's off-topic. – ImportanceOfBeingErnest Sep 25 '17 at 10:50
  • @ImportanceOfBeingErnest I get the point with the MCV examples, but I disagree that it is off-topic (also nowhere in the linked FAQ page does it say that questions without a MCVE are off-topic). I feel this question is very much on topic here at SO, although might be difficult to answer, which is a risk I took when asking the question. Thanks for not trying to answer (and likely voting to close). – posdef Sep 25 '17 at 12:12
  • It is off-topic because the content of the text files is unknown and some of the functions in the code are unknown. So basically you are asking how to get a defined output from an unknown input which is not answerable and hence off-topic. – ImportanceOfBeingErnest Sep 25 '17 at 20:39
  • @ImportanceOfBeingErnest I thought I mentioned that the files are pretty simple, containing a string and an int, but I should have made it more explicit. As for the code, there was very little that wasn't there. The issue is that I have a dict of a dict of a dict, that needs to be visualised, Thats why I didn't provide unrelated details. I hope the edits make the question more clear – posdef Sep 26 '17 at 08:41

1 Answers1

0

Though I'm not sure if this is the best way, list comprehension might be one possibility:

import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate your data
d = {'high': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
              {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
              {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}],
     'low': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
             {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
             {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}]}
df = pd.DataFrame(d)
print(df.head())

# “Unpivots” your data
l = [(col, letter, count) 
     for col, series in df.items() 
     for _, dd in series.to_dict().items() 
     for letter, count in dd.items()]
new_df = pd.DataFrame(l)
new_df.columns = ['variable', 'letter', 'count']
print(new_df.head())

# Boxplot with seaborn
sns.boxplot(x='letter',y='count',data=new_df,hue='variable')
plt.show()

For the big problem you've described here, I think it might be better if you "unpivot" before making DataFrame, i.e. use list comprehension instead of dict comprehension at the line you've commented. I don't have your data. I can only guess it might be something like this:

d = [(name, letter, count)
     for name, pc in data.items()
     for s in pc.seqs
     for letter, count in getCounts(s)]
df = pd.DataFrame(d)
df.columns = ['variable', 'letter', 'count']
Y. Luo
  • 5,622
  • 1
  • 18
  • 25