0

I have a dataframe like so (the real one has 300+ rows):

        cline    endpt  fx     type  colours 
        SF-268   96.5   1       CNS  #848B9E
22      SF-268  103.3   2       CNS  #848B9E
23      SF-268   60.7   3       CNS  #848B9E
24      SF-268    5.0   4       CNS  #848B9E
25      SF-268    8.7   5       CNS  #848B9E
26      SF-268   -9.4   6       CNS  #848B9E
27      SF-268  -20.7   7       CNS  #848B9E
28      SNB-75  105.5   1       CNS  #848B9E
29      SNB-75   94.5   2       CNS  #848B9E
30      SNB-75   35.3   3       CNS  #848B9E
..         ...    ...  ..       ...      ...
71      SW-620   95.6   2     Colon  #468F14
72      SW-620   73.5   3     Colon  #468F14
73      SW-620    4.0   4     Colon  #468F14
74      SW-620    9.7   5     Colon  #468F14
75      SW-620  -58.6   6     Colon  #468F14
76      SW-620  -49.1   7     Colon  #468F14
77    CCRF-CEM   95.8   1  Leukemia  #FF041E
78    CCRF-CEM   96.6   2  Leukemia  #FF041E
79    CCRF-CEM   89.2   3  Leukemia  #FF041E
80    CCRF-CEM    3.5   4  Leukemia  #FF041E
81    CCRF-CEM   13.7   5  Leukemia  #FF041E
82    CCRF-CEM  -21.3   6  Leukemia  #FF041E
83    CCRF-CEM   -6.6   7  Leukemia  #FF041E
84   HL-60(TB)   93.9   1  Leukemia  #FF041E
85   HL-60(TB)   95.3   2  Leukemia  #FF041E
86   HL-60(TB)   94.0   3  Leukemia  #FF041E
87   HL-60(TB)   13.3   4  Leukemia  #FF041E
88   HL-60(TB)   14.6   5  Leukemia  #FF041E
89   HL-60(TB)  -44.0   6  Leukemia  #FF041E
90   HL-60(TB)  -57.0   7  Leukemia  #FF041E
91       K-562   88.1   1  Leukemia  #FF041E
92       K-562   97.1   2  Leukemia  #FF041E
93       K-562   73.6   3  Leukemia  #FF041E
94       K-562    6.6   4  Leukemia  #FF041E
95       K-562    7.0   5  Leukemia  #FF041E
96       K-562  -21.9   6  Leukemia  #FF041E
97       K-562  -29.6   7  Leukemia  #FF041E
98      MOLT-4   98.9   1  Leukemia  #FF041E
99      MOLT-4   96.8   2  Leukemia  #FF041E
100     MOLT-4   68.9   3  Leukemia  #FF041E

I used the following examples to help me produce my code at the bottom:

I managed to get a plot, however I think the line plot connects the last y value with the first, making a straight line (image below). I'm not sure why. Any help would be appreciated. Thanks.


import csv
import numpy as np
import pandas as pd
import itertools
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
labels = []
for key, grp in dfm.groupby(['colours']):
    ax = grp.plot(ax=ax,linestyle='-',marker='s',x='fx',y='endpt',c=key)
    labels.append(key)
lines, _ = ax.get_legend_handles_labels()
g=[]
for i in labels:
    g.append(list(co.keys())[list(co.values()).index(i)])
ax.legend(lines, g, loc='best')   

enter image description here

Spencer Trinh
  • 743
  • 12
  • 31

2 Answers2

1

The problem is that the values on the xaxis (fx) are not monotonically increasing. Therefore, the line jumps back as the x values jumps from 7 back to 1. To avoid this, one may insert nan into the lists to be plotted at the positions where this jump would occur. This can be done like

g = lambda x,y: np.insert(y.astype(float), np.arange(len(x)-1)[np.diff(x) < 0]+1, np.nan)

where x is the array of x values and y is the array into which the nans are inserted. Then plotting may be performed by calling this function on the x and y values

ax.plot(g(x,x), g(x,y),marker='s')

A solution using a DataFrame is shown below.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

x = range(1,8)*4
y = np.array([np.exp(-np.arange(1,8)/3.)*i+i/2. for i in np.arange(1,5)/10.]).flatten()
df = pd.DataFrame({"x":x, "y":y})
print df
fig, (ax,ax2) = plt.subplots(ncols=2)

df.plot(x='x',y='y',ax=ax,marker='s')


g = lambda x,y: np.insert(y.astype(float), np.arange(len(x)-1)[np.diff(x) < 0]+1, np.nan)
ax2.plot(g(df.x.values,df.x.values), g(df.x.values,df.y.values),marker='s')
plt.show()

enter image description here

A full example of grouping by colors:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

x = range(1,8)*4
y = np.array([np.exp(-np.arange(1,8)/3.)*i+i/2. for i in np.arange(1,5)/10.]).flatten()
df = pd.DataFrame({"x":x, "y":y, "colours": ["#aa0000"]*len(x)})
x2 = range(1,6)*3
y2 = np.array([np.exp(-np.arange(1,6)/2.5)*i+i/2.1 for i in np.arange(1,4)/10.]).flatten()
df2 = pd.DataFrame({"x":x2, "y":y2, "colours": ["#0000aa"]*len(x2)})
df = df.append(df2)


fig, ax = plt.subplots()

g = lambda x,y: np.insert(y.astype(float), np.arange(len(x)-1)[np.diff(x) < 0]+1, np.nan)

for key, grp in df.groupby(['colours']):
    ax.plot(g(grp.x.values,grp.x.values), g(grp.x.values,grp.y.values),
            marker='s', color=key, label=key)

ax.legend()
plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • thanks for your suggestion :) may I ask how would I add the NaN across the rows in the DataFrame instead of just one column. I need to do this so that I can maintain the colourset column to colour-coordinate the line plots. – Spencer Trinh Jul 14 '17 at 02:13
  • I was able to add NaN across every 8th row as you suggested. I referenced this: https://stackoverflow.com/questions/44599589/inserting-new-rows-in-pandas-data-frame-at-specific-indices?noredirect=1&lq=1. I still get the straight line connecting to the first point however. – Spencer Trinh Jul 14 '17 at 04:06
  • The idea presented here is to not manipulate the dataframe, but to create new numpy arrays with the values inserted and then plot them using matplotlib `plot` function. The color for each plot would be the `key` as in the example from the question. While a solution like the one you link to might work as well, how should I know what went wrong there? – ImportanceOfBeingErnest Jul 14 '17 at 07:50
  • I assumed you needed to insert the `nan`'s across the row to have same dimension dataframe, thats why I referenced that other question. Your code looks great, but tbh I dont know how to implement it correctly. this is what I tried: `ax.plot(g(dfm['fx'].values,dfm['fx'].values),g(dfm['fx'].values,dfm['endpt'].values),marker='s')` and I get: TypeError: 'AxesSubplot' object is not iterable. – Spencer Trinh Jul 16 '17 at 16:17
  • So I found out I mis-typed plot.subplots(), so now it plots correctly however I still do not know how to get the colours to appear properly. I get this error: ValueError: Length mismatch: Expected axis has 35 elements, new values have 279 elements. Isn't this because of the dataframe dimension is different with the inserted `nan`'s? – Spencer Trinh Jul 16 '17 at 16:51
  • I'm not sure how to use the loop and your g function together. I made another lambda fx exactly like yours without float type to insert `nan`'s in the other columns of df and tried the loop using groupby['colours'], but the straight lines appear once again. I tried this to address the issue with dimension mismatch. – Spencer Trinh Jul 16 '17 at 17:01
  • I added an example for grouping by colors. Essentially the only difference is the for-loop and the setting of the color as `color=key`. – ImportanceOfBeingErnest Jul 16 '17 at 21:49
  • Thanks, I noticed a lot of typos in my code after comparing it to yours. – Spencer Trinh Jul 17 '17 at 04:18
0

Your data seems to be unsorted, it sounds like you want to sort your data by increasing x-value after grouping it:

grp.sort_values(by="fx")
devnull
  • 56
  • 3
  • hi thanks for your suggestion. each of my line plots require the 7 'fx' values for the x axis and the 7 'endpt' values for the y axis. When I sort it, matplotlib simply connects everything together which is not what I am looking for – Spencer Trinh Jul 14 '17 at 04:08
  • okay, sorry i did not fully understand what you wanted than. Can't you group by each of these lines first. Your data looks like you can group by cline and use the color key. Alternatively, you can first find the groups by the diff approach presented in the answer above and than use this to group if you want to solve it in a more pandas like way. – devnull Jul 15 '17 at 18:23