2

Suppose we have a pandas Series of lists where each list contains some characteristics described as strings like this:

0  ["A", "C", "G", ...]
1  ["B", "C", "H", ...]
2  ["A", "X"]
...
N  ["J", "K", ...]

What would be the best/easiest way to plot a 2D pixel grid where the X axis is occurrence of the characteristic and the Y axis each sample in the series 0,1,2,..., N?

Edited on Sept 22 16:

It seems I haven't mentioned explicitly that the list of characteristics isn't necessarily of the same size for all observations. The observation 1 can have 4 characteristics, observation 2 can have no one, observation 3 can have 5 and so on. So, I can't transform them into a numpy array right away without preprocessing them in some way that the missing characteristics are filled in.

srodriguex
  • 2,900
  • 3
  • 18
  • 28
  • Have you thought of converting it to a Numpy matrix and then using matplotlib to do what you want? Please include your approach to solve the problem in the question as well. – Kartik Sep 22 '16 at 03:58
  • Sort of taking a shot in the dark here, but are you looking for a [2D "pixel grid" like this](http://i.stack.imgur.com/N5HPE.png)? If not please elaborate on what you want, it doesn't make a lot of sense to me to have a 2D plot that has occurrence as one of the axes. – lanery Sep 22 '16 at 05:20
  • I think the real problem is to first transform the features in the list of lists in a sort of matrix. Please note that the observations in the series aren't necessarily of the same size. It's something like the `pandas.get_dummies()`, but this method extracts a matrix based on the scalar values of a single column, not values in lists in a single column. – srodriguex Sep 23 '16 at 00:17
  • matplotlib [mapshow](http://matplotlib.sourceforge.net/examples/pylab_examples/matshow.html) is what I wanted. – srodriguex Sep 23 '16 at 02:10

2 Answers2

3

Since I already wrote the code for the image in my comment, and Ed seems to have the same interpretation of your question as I do, I'll go ahead and add my solution.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string

M, N = 100, 10
letters = list(string.ascii_uppercase)
data = np.random.choice(letters, (M, N))

df = pd.DataFrame(data)
# Get frequency of letters in each column using pd.value_counts
df_freq = df.apply(pd.value_counts).T

# Plot frequency dataframe with seaborn heatmap
ax = sns.heatmap(df_freq, linewidths=0.1, annot=False, cbar=True)
plt.show()

enter image description here

lanery
  • 5,222
  • 3
  • 29
  • 43
  • It's almost there. As I explained after editing the question, I don't have a perfect MxN matrix beforehand, with M observations and N characteristics. I have a list of M observations where each one has a list up to N characteristics. – srodriguex Sep 23 '16 at 00:19
  • I figured out how to transform the list of M observations where each observation is a list up to M characteristics using `CountVectorizer` feature extraction of [sklearn](http://scikit-learn.org/). Once that's done, this questions is resolved by this answer. – srodriguex Sep 23 '16 at 01:19
  • Great, glad it worked for you! And keep in mind that you can consider adding your own answer if what you had to add to my answer was substantial. – lanery Sep 23 '16 at 02:50
  • This is a much better answer, I had a feeling there would be a pandas function to do it... – Ed Smith Sep 23 '16 at 08:00
1

Using pandas for a 1D histrogram seems to be straightfoward as in this answer. You could use this idea and fill an array of N by 26 and then plot in 2D with

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import string
from collections import Counter

#Generate list of letters and dataframe
N = 20
M = 1000
letterlist = []
for i in range(N):
    letterlist.append([random.choice(string.ascii_uppercase) for i in range(M)])
df = pd.DataFrame(letterlist)

#Fill an array of size N by 26
im = np.zeros([N,26])
for n in range(N):
    #Get histogram of letters for a line as Dict
    letter_counts = Counter(df.loc[n])
    #Add to array
    for k in letter_counts.keys():
        c = ord(k.lower()) - 97
        im[n,c] = letter_counts[k]

#Plot
plt.imshow(im, interpolation='none')
plt.colorbar()
plt.axis('tight')
plt.xticks(range(26), [i for i in string.ascii_uppercase])
plt.show()
Community
  • 1
  • 1
Ed Smith
  • 12,716
  • 2
  • 43
  • 55