0

I have a dataset:

a   b   c   d
10-Apr-86   Jimmy   1   Silly.doc
11-Apr-86   Minnie  2   Lala.doc
12-Apr-86   Jimmy   3   Goofy.doc
13-Apr-86   Minnie  4   Hilarious.doc
14-Apr-86   Jimmy   5   Joyous.doc
15-Apr-86   Eliot   6   Crackingup.doc
16-Apr-86   Jimmy   7   Funny.doc
17-Apr-86   Eliot   8   Happy.doc
18-Apr-86   Minnie  9   Mirthful.doc

Using the following code in python 2.7.12..

df = (pd.read_csv('python.csv'))
df_wanted = pd.pivot_table(
    df,
    index='a',
    columns='b',
    values='c')

df_wanted.index = pd.to_datetime(df_wanted.index)

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(df_wanted.index, df_wanted['Jimmy'], s=50, c='b', marker="s")
ax1.scatter(df_wanted.index,df_wanted['Minnie'], s=50, c='r', marker="o")
ax1.scatter(df_wanted.index,df_wanted['Eliot'], s=50, c='g', marker="8")
plt.legend(loc='upper left');

for k, v in df.set_index('a').iterrows():
    plt.text(k, v['c'], v['d'])
plt.show()

.. I can create the following visualization in matplotlib:

enter image description here

The problem: this is only a toy dataset. When I apply this code to my real dataset, which has more than 3000 points, all the data labels blend together in a black illegible block.

I would like to avoid this problem by using the code here to make the data labels appear when they are clicked.

The issue I'm having is with this part of the above-mentioned code,

x=[1,2,3,4,5]
y=[6,7,8,9,10]

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
scat = ax.scatter(x, y)
DataCursor(scat, x, y)
plt.show()

Obviously, I need to replace the "x" and "y" with my pivot table columns, but I don't know how to make scat = ax.scatter(x, y) or DataCursor(scat, x, y) work with my data.

I tried the following

scat = ax1.scatter(df_wanted.index, df_wanted['Minnie'], s=50, c='b', marker="s")
scat1 = ax1.scatter(df_wanted.index,df_wanted['Jimmy'], s=50, c='r', marker="o")
scat2 = ax1.scatter(df_wanted.index,df_wanted['Eliot'], s=50, c='g', marker="8")

DataCursor(scat,df_wanted.index,df_wanted['Minnie'])
DataCursor(scat1,df_wanted.index,df_wanted['Jimmy'])
DataCursor(scat2,df_wanted.index,df_wanted['Eliot'])
plt.show()

But I get this error TypeError: Invalid Type Promotion

UPDATE: I used the code from here to get the doc name in the console:

from matplotlib.pyplot import figure, show
import numpy as npy
from numpy.random import rand
import pandas as pd

df = (pd.read_csv('python.csv'))
df_wanted = pd.pivot_table(
    df,
    index='a',
    columns='b',
    values='c')

df_wanted.index = pd.to_datetime(df_wanted.index)


if 1: # picking on a scatter plot     (matplotlib.collections.RegularPolyCollection)

    c = 'r'
    c1 = 'b'
    c2 = 'g'
    s = 85 
    y = df_wanted['Minnie']
    z = df_wanted['Jimmy']
    f = df_wanted['Eliot']
    x = df_wanted.index
    def onpick3(event):
        ind = event.ind
        print npy.take(df['d'], ind)

    fig = figure()
    ax1 = fig.add_subplot(111)
    col = ax1.scatter(x, y, s, c, picker=True)
    ax2 = fig.add_subplot(111)
    col = ax1.scatter(x, z, s, c1, picker=True)
    ax3 = fig.add_subplot(111)
    col = ax1.scatter(x, f, s, c2, picker=True)
    plt.legend(loc='upper left')
    #fig.savefig('pscoll.eps')
    fig.canvas.mpl_connect('pick_event', onpick3)

show()

The problem now is that the document name being returned is not accurate. I think the problem is that the ind number is for each individual series. I need a way to combine all the series, and assign an ind number to their total.

oymonk
  • 427
  • 9
  • 27

1 Answers1

0

I found a solution. I realized I wanted to follow this example (Matplotlib scatterplot; colour as a function of a third variable), but needed to first make a single list of x values and a single list of y values, rather than individual lists of x and y values for each series.

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure, show
import numpy as npy
from numpy.random import rand
import pandas as pd

df = (pd.read_csv('python.csv')) #upload dataset

df['a'] = pd.to_datetime(df['a']) #convert date column to useable format

x = list(df['a'].values.flatten()) #get dataframe column data in list format
y= list(df['c'].values.flatten()) #get dataframe column data in list format

var_names = list(df['b'].values.flatten()) #get dataframe column data in list format
var_names1 = list(set(var_names)) #get unique values from column b (names) 


d = {var_names1[n]:n for n in range(len(var_names1))} #generate dictionary that assigns number to each unique name in col B



namesAsNumbers = [d[z] for z in var_names] #replace names with numbers in column B

c=  namesAsNumbers



if 1: # picking on a scatter plot (matplotlib.collections.RegularPolyCollection) # user picks point on scatter

    def onpick3(event):
        ind = event.ind
        print npy.take(df['d'], ind) #print the document name associated with the point that's been picked

    fig = figure()
    ax1 = fig.add_subplot(111)
    col = ax1.scatter(x, y, s= 100, c=c, picker=True)
    #fig.savefig('pscoll.eps')
    fig.canvas.mpl_connect('pick_event', onpick3)
    plt.legend()
show()

Only problem I still have: can't seem to get a legend to appear.

oymonk
  • 427
  • 9
  • 27