2

I have the following pandas DataFrame:

import pandas as pd
df = pd.read_table(...)
df

>>> df
>>>    interval  location type  y_axis
0        01      1230    X      50
1        01      1609    X      55
2        01      1903    Y      54
3        01      2574    A      58
4        01      3151    A      57
5        01      3198    B      46
6        01      3312    X      50
...                 .....
         02      42      X      31
         02      214     A      23
         02      598     X      28
....

There are several intervals, e.g. 01, 02, etc. Within each interval, data points lie within the range of 1 to 10,000. In df, the first datapoint is at 40, the next at 136, etc.

Interval 02 also has a range from 1 to 15,000.

I would like to create a scatterplot, such that the range of 1 to 15000 is proportionally plotted for each interval. Then the first point would be plotted at 1230, the next plotted at 1609, etc. I would also like a vertical line which shows where the intervals are. The scatterplot's x-axis should be spaced from 1 to 10,000. Each interval is a "region", containing this x-axis from 1 to 10,000. So the coordinates on the x-axis are interval1: 1 to 15000, interval2: 1 to 15000, interval 3: 1 to 15000, etc. (It is almost like several individual scatterplots concatenated together.)

How does one accomplish this? Without this complication of intervals, if one wished to create a scatterplot from this DataFrame, one would use:

df.plot(kind='scatter', x = "location", y = "y_axis")

Here are the first 50 rows:

d = {"interval" : ["01",                                                                                                                                                                                                              
 "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",                                                                                                                                                                                                          
 "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",                                                                                                                                                                                                          
 "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",                                                                                                                                                                                                          
 "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",                                                                                                                                                                                                          
 "01", "01", "01", "01", "01"], "location" : [1230, 1609,                                                                                                                                                                                                      
 1903, 2574, 3151, 3198, 3312, 3659, 3709,                                                                                                                                                                                                      
 3725, 4172, 4542, 4860, 4900, 5068, 5220,                                                                                                                                                                                                      
 5260, 5339, 5442, 5529, 5773, 6128, 6165,                                                                                                                                                                                                      
 6177, 6269, 6275, 6460, 7167, 7361, 7361,                                                                                                                                                                                                      
 8051, 8222, 8305, 8992, 9104, 9439, 9844,                                                                                                                                                                                                      
 10045, 10764, 10787, 11104, 11478, 11508,                                                                                                                                                                                                          
 11684, 12490, 12590, 12794, 12803, 13823,                                                                                                                                                                                                          
 13982], "type" : ["X", "X", "Y", "A", "A",                                                                                                                                                                                                              
     "B", "X", "X", "X", "B", "B", "A", "A", "A", "B", "B", "X",                                                                                                                                                                                                            
     "B", "Y", "X", "X", "Y", "Y", "C", "A", "X", "X", "Z", "Z",                                                                                                                                                                                                            
     "B", "X", "X", "A", "A", "Y", "X", "A", "X", "X", "Z", "Z",                                                                                                                                                                                                            
     "C", "X", "Y", "Y", "Z", "Z", "Z", "Z", "Z"],  "y_axis" : [50, 55, 
    54, 58, 57, 46, 50, 55, 46, 42, 56, 55, 55, 45, 52, 51, 45, 48, 50,
     49, 53, 55, 45, 40, 49, 37, 52, 58, 52, 4, 58, 52, 49, 58, 50, 55, 
    56, 53, 58, 43, 55, 55, 44, 52, 59, 49, 53, 39, 60, 52]}
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • 1
    It's a bit challenging to understand your question, as there are some inconsistencies. (1) By `dt` do you mean `df`? (2) You say the first datapoint is 40, the second 136. But where are these values in your example data? (3) Your first 50 rows only contain one value for `interval`. Can you provide example data that captures multiple intervals? (4) It would be helpful if you can provide a sketch or link to an example of the plot you want - it's not easy to visualize, based on your description. – andrew_reece May 02 '17 at 05:12
  • I would even consider this question completely unclear and thus unanswerable. – ImportanceOfBeingErnest May 02 '17 at 10:39
  • @andrew_reece Apologies. I corrected some of the values. (1) it is all `df` now (2) see the example data (3) things get large quickly. Imagine the same data for `location`, `type`, `y_axis`, except with `interval` values all `02`. I will provide an example graph – ShanZhengYang May 02 '17 at 15:23
  • @andrew_reece Here's an example of what I mean. It is a scatterplot which has been partitioned into several regions. http://imgur.com/a/l6BvG Each region has the same interval of datapoints, from 1 to 15000. It's therefore tricky to set the x axis, as these points will be mashed together. Am I making sense now? – ShanZhengYang May 02 '17 at 16:12
  • @ImportanceOfBeingErnest Given edits/comments, is it clear now? – ShanZhengYang May 02 '17 at 17:35
  • Thanks, the image you linked to is helpful. Consider including it in the main text of your post. There are still some confusing points that remain. (1) How do you intend to scale the width of each interval's x-axis segment? (2) You change between referring to `xlim` as `(0,10000)` and `(0,15000)`. Which is it? (3) The data points you describe (40, 136) seem unrelated to the example data you're describing. At any rate, see my answer for an approach that I think solves most of your problems. – andrew_reece May 03 '17 at 05:31
  • @andrew_reece Thanks for the help. (1) I know the width of these intervals a priori. It's part of my question---given that I know these are e.g. 15000 long, how do I manually set this for each interval? (2) It's 15000; I was trying to simplify things in the first description (3) this is true; again, I was trying to simplify things (poorly) – ShanZhengYang May 03 '17 at 16:47

3 Answers3

3

It seems the main challenge here is that you want the x-axis to be both categorical (intervals 01, 02, etc) and metric (values 1-15000). You're really talking about plotting several scatter plots with a shared y-axis, as you even pointed out in your post. I'd suggest you do exactly that, using subplots and groupby. You can adjust the space between plots using subplots_adjust(), as I've done in this answer.

First, generate some sample data using d from OP. We'll also randomly select half of the observations and change them to interval=02, to demonstrate the desired paneling:

import pandas as pd
import numpy as np

df = pd.DataFrame(d)

# shuffle rows 
# (taken from this answer: http://stackoverflow.com/a/15772330/2799941)
df = df.reindex(np.random.permutation(df.index))

# randomly select half of the rows for changing to interval 02
interval02 = df.sample(int(df.shape[0]/2.)).index
df.loc[interval02, 'interval'] = "02"

Now specify side-by-side subplots, using pyplot, and remove any padding between the plots.

from matplotlib import pyplot as plt

# n_plots = number of different interval values
n_plots = len(df.interval.unique())

fig, axes = plt.subplots(1, n_plots, figsize=(10,5), sharey=True)

# remove space between plots
fig.subplots_adjust(hspace=0, wspace=0)

Finally, groupby interval and plot:

for i, (name, group) in enumerate(df.groupby('interval')):
    group.plot(kind="scatter", x='location', y='y_axis', 
               ax=axes[i], title="Interval {}".format(name))

side-by-side plot

andrew_reece
  • 20,390
  • 3
  • 33
  • 58
2

It seems you want to plot a different scatter plot for each category "interval".
This can be done by grouping the dataframe by the respective column.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

cat = ["01"] *5 + ["02"]*4
x = np.append(np.arange(1,6), np.arange(2.5,4.1,0.5))
y = np.random.randint(12,24, size=len(cat))
df = pd.DataFrame({"cat":cat, "x":x, "y":y})

fig, ax = plt.subplots()
colors={"01":"crimson", "02":"darkblue"}
for cat, grouped in df.groupby("cat"):
    grouped.plot(kind="scatter", x="x", y="y", ax=ax, label=cat, color=colors[cat])

plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • Now that I see the image, this isn't what I mean. See @andrew_reece's answer. The red/blue points should be separated. – ShanZhengYang May 03 '17 at 16:48
  • You always get what you ask for. If the question is sufficiently unclear like this one here, you may get any kind of answer. You may decide to provide a clear problem description next time and then also get answers fitting more to your needs. – ImportanceOfBeingErnest May 03 '17 at 16:58
  • That's very fair. And I thank you for pointing out how unclear it was----I've edited. I also appreciate your help. No hard feelings or anything.... – ShanZhengYang May 03 '17 at 17:14
1

Using Altair you can easily separate the two intervals as different columns/colors.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

cat = ["01"] *5 + ["02"]*4
x = np.append(np.arange(1,6), np.arange(2.5,4.1,0.5))
y = np.random.randint(12,24, size=len(cat))
df = pd.DataFrame({"cat":cat, "x":x, "y":y})

By columns

from altair import *
Chart(df).mark_point().encode(x='x', y='y', column='cat').configure_cell(width=200, height=150)

enter image description here

By color

from altair import *
Chart(df).mark_point().encode(x='x', y='y', color='cat').configure_cell(width=200, height=150)

enter image description here

Nipun Batra
  • 11,007
  • 11
  • 52
  • 77