I am trying to create a box plot with multiple data series and categories, so something like this:
The data I have is several files such that each file contains one of the series (e.g. 'high' and 'low'). For each file I have several thousand lines of tuples containing a string
and an int
, e.g.
('HHFRVEHAVAEGAK', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('IKEEAVKEKSPSLGK', '3')
('ALLHTVTSILPAEPEAE', '2')
('VAVPTGPTPLDSTPPGGAPHPLTGQEEARAVEK', '5')
I would like to plot the occurrence distribution of the characters in these sequences.
class MyObj(object):
__slots__ = ['name', 'seqs', 'charges']
def __init__(self, name, tuples):
self.name = name
self.seqs = set()
seqs, zs = zip(*tuples)
self.seqs.update(seqs)
#self.charges = collections.Counter(zs)
self.charges = zs
data = {}
inf = ['high_corr.txt', 'low_corr.txt']
names = ['high', 'low']
for i, somefile in enumerate(inf):
with open(somefile, 'r') as f:
entries = [literal_eval(line.strip()) for line in f]
index = names[i] if names else f"File{i}"
data[index] = MyObj(index, entries)
def getCounts(seq):
c = collections.Counter(seq)
return {aa: c[aa] for aa in seq}
d = {name: [getCounts(s) for s in pc.seqs] for name, pc in data.items()} # <- tried dict comprehension as well
df = pd.DataFrame.from_dict(d, orient='index')
df = df.transpose()
So when I am done reading the files, I get something like this:
As you can see I cannot get the individual characters out, they are read as dicts, and thus do not get plotted.
Is there a way I can break the letters out, and have them as a third column, like in the example in the linked question? To reiterate, what I want to achieve is a boxplot with letters on the x-axis, and two boxes drawn (high
and low
) for each letter.