0

First of all I show you what I need: I need a boxplot with broken x-axis, possibily more than a single break. An example is this figure enter image description here

Now: I have two list of the form X and Y ( X = float, Y = int). First I group Y in sublists according to the integer part of X (X and Y are the same length):

number_of_units = int(max(X)) + 1
my_data = []
for i in range(number_of_units):
  my_data.append([])

for i in range(len(X)):
  j = int(X[i] )
  my_data[j].append(Y[i])

In this way my_data is a list of lists, with number_of_units sublists. The k-th subslist contains all the X values that are associated to Y values whose integer part is k. Here the problem: most of the subslists are empty: Y spans many orders of magnitude and typical values of number_of_units is 10^5, but most of the Y have integer part in [1,10] so that most of the sublists in my_data are empty. The direct consequence is that if I do

fig, ax = plt.subplots()
ax.boxplot(my_data, 'options')

I obtain something like the following figure (note the "upper-right" red point):

enter image description here

This is due to the emptyness of most of the sublists in my_data: most of the plot shows "zero-frequency". So what I need is to break the x-axis of the plot whenever the frequency is zero. Note that:

  • The points where the ax has to be broken must be found dynamically, since they change with the data.
  • There are very high chances that the ax has to be broken multiple times

Theoretical idea

  1. Split the list my_data into M lists of lists, where the split has to be done according to the emptyness of my_data: if my_data[k] is the first empty sublist, than my_data[0],...,my_data[k-1] is the first group; then find the first non empty sublist with index >k and there the second group begins. When I find another empty sublists, the second group is formed and so on. I hope I was clear.

  2. Do a ax.boxplot() for each of the new list of lists. This time none of the sublists will be empty.

  3. Plot each ax as subplots and join all the subplots as suggested here.

This approach has a number of difficulties to me. The main problem is that I don't know a priori the number of subplots I will need, this number depending on the dataset and this is a problem I really don't know how to overcome. So I ask:

How can I authomatically locate the regions of the X-axis that have non-zero frequency and plot only those regions, with an underlying broken ax everytime the regions end?

Any suggestion would be appreciated.

EDIT

My question is not a duplicate of this questions because the latter does not contains any explanation on how to break the X axis. However the combination of the information in questions 1 and 2 might fully solve the problem. I'm actually working on it and I will edit the question further when the problem will be solved.

GRquanti
  • 527
  • 8
  • 23
  • Have you tried to implement any of your proposed solutions (theories)? Which one did you like best and if it was not satisfactory, how was it deficient? – wwii Apr 15 '18 at 14:31
  • The theory is thought to be a unique solution: delete zero-frequency list, do the boxplots of non zero-frequency only and finally merge the various plots. I am not able to implement it because the number of subplots is not known a priori. – GRquanti Apr 15 '18 at 14:34
  • Are you asking how to go about making an unknown/arbitrary number of subplots? – wwii Apr 15 '18 at 14:38
  • I'am asking how to automatically plot only the regions of non zero frequency. If you kow how to do an unknown number of subplots, maybe I can go over my problem whit it. But it is not necessary, it is a possible solution. – GRquanti Apr 15 '18 at 14:42
  • `non zero frequency` means `sub-list in my_data that contain data`? Are you asking two questions? How to filter out empty lists from `my_data` *and* how to make an arbitrary number of subplots from the result? – wwii Apr 15 '18 at 14:48
  • yes, this is the meaning of non zero frequency. I am not asking two questions. I don't know how to filter out empy lists but I can go over this (I hope). If you know how to do an arbitrary number of subplots maybe I can solve the problem. But if you know how to plot only the regions of non-zero frequency without filter out lists and doing many subplots it is fine too. – GRquanti Apr 15 '18 at 14:57
  • Possible duplicate of [Dynamically add/create subplots in matplotlib](https://stackoverflow.com/questions/12319796/dynamically-add-create-subplots-in-matplotlib). It was hard to choose which was best suited for a dupe - search for `python matplotlib make an arbitrary number of subplots` maybe there is a better fit, but they all rely on looping through datasets and using them as arguments to `.subplot()`. Seems you might also want to incorporate a shared y-axis feature. – wwii Apr 15 '18 at 14:58
  • I don't think it's a duplicate because multiple subplots is not the only way to solve the problem. The question you posted may help me, but I don't need shared Y-axis for many subplots, I need a broken X-axis as in the link I posted at the point 3. – GRquanti Apr 15 '18 at 15:06
  • [Python: How to remove empty lists from a list?](https://stackoverflow.com/questions/4842956/python-how-to-remove-empty-lists-from-a-list) - you would be better off not putting them in there in the first place. – wwii Apr 15 '18 at 15:06
  • Sorry, I can't understand: what you mean by "not putting them in there"? Putting who and where? – GRquanti Apr 15 '18 at 15:08
  • `my_data[j].append(Y[i])` -> use a conditional expression and only append them if they contain something. – wwii Apr 15 '18 at 15:48
  • Y[i] always contains something. The problem is the value of j: it can span from 0 to number_of_units, (e.g. 0, 1,...,10^5) but many of these values don't occurr. – GRquanti Apr 15 '18 at 15:56

2 Answers2

1

Consider a data sample built like this:

import numpy as np
from pylab import *
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from itertools import *
from operator import itemgetter
import scipy.stats as stats

def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))

a, m = 2, 100000
d = truncated_power_law(a=a, m=m)
N = 10**2

X = np.sort(np.asarray(list(set(d.rvs(size=N)))))
Y = []
for i in range(0,len(X)):
Y.append(i*np.random.rand(100))

Don't care nothing about the data except that X is power law distributed. This imples that a lot of values between min(X) and max(X) don't appear in the sample.

Now, if you limit yourself doing

m_props = {'color': 'red',}
b_props = {'color': 'black', 'linestyle': '-'}
w_props = {'color': 'black', 'linestyle': '-'}
c_props = {'color': 'black', 'linestyle': '-'}

f_ugly, ax_ugly = plt.subplots()
ax_ugly.boxplot(Y, notch = 0, sym = '', positions = X, medianprops = 
        m_props, boxprops = b_props, whiskerprops = w_props, capprops 
        = c_props)

You obtain something like this:bad_box

Now consider this:

#X is divided in sublists of consecutive values
dominiums = []
for k, g in groupby(enumerate(X), lambda (i,j):i-j):
    dominiums.append(map(itemgetter(1), g))

number_of_subplots = len(dominiums)

k = 0
d = .01
l = .015

f, axes = plt.subplots(nrows = 1, ncols = number_of_subplots, sharex = 
              False, sharey = True, gridspec_kw = {'width_ratios':
              [3*len(dominiums[h]) for h in 
              range(number_of_subplots)],'wspace':0.05})

axes[0].yaxis.tick_left()
axes[0].spines['right'].set_visible(False)

kwargs = dict(transform = axes[0].transAxes, color='k', linewidth = 1, 
         clip_on = False)
axes[0].plot((1-d/1.5,1+d/1.5), (-d,+d), **kwargs)
axes[0].plot((1-d/1.5,1+d/1.5),(1-d,1+d), **kwargs)
kwargs.update(transform = axes[-1].transAxes)
axes[-1].plot((-l,+l), (1-d,1+d), **kwargs)
axes[-1].plot((-l,+l), (-d,+d), **kwargs)

for i in range(number_of_subplots):
    data_in_this_subplot = []
    for j in range(len(dominiums[i])):
        data_in_this_subplot.append([])
        data_in_this_subplot[j] = Y[k]
        k = k + 1

    axes[i].boxplot(data_in_this_subplot, notch = 0, sym = '', 
            positions = dominiums[i], medianprops = m_props, boxprops 
            = b_props, whiskerprops = w_props, capprops = c_props)

    if i != 0:
        axes[i].spines['left'].set_visible(False)
        axes[i].tick_params(axis = 'y', which = 'both', labelright = 
                False, length = 0)
    if i != number_of_subplots -1:
        axes[i].spines['right'].set_visible(False)
        kwargs = dict(transform = axes[i].transAxes, color='k', 
                 linewidth = 1, clip_on=False)
        axes[i].plot((1-l,1+l), (-d,+d), **kwargs)
        axes[i].plot((1-l,1+l),(1-d,1+d), **kwargs)
        kwargs.update(transform = axes[i].transAxes)
        axes[i].plot((-l,+l), (1-d,1+d), **kwargs)
        axes[i].plot((-l,+l), (-d,+d), **kwargs)

Using the same data of the first figure, the latter code produces the following: good box

IMHO this code fully answer to the question: it authomatically locate the relevant regions of the X axis and plot only those regions, whit an undelrlying broken ax everytime the region ends.

Weankess of the solution: it has a number of arbitrary parameters that must be tuned for every different data set (e.g. d,l, the number 3 in 3*len(dominiums[h])

Strenght of the solution: you don't need to know a priori the number of relevant regions (i.e. the number of subplots)

Thanks to wwii for his usefoul answer and comments.

GRquanti
  • 527
  • 8
  • 23
0

Without further evidence (your question lacks a minimal example of Xand Y), it looks like X and Y values are related to each other by their positions/indices and you are trying to preserve that relationship by placing Y values in my_data at the index of the related X value. I imagine you are doing that so you don't have to pass the X values to .boxplot() but that creates a lot of empty space that you don't want in your visualization.

If your data looks similar to this fake data:

X = [1,2,3,9,10,11,50,51,52]
Y = [590, 673, 49, 399, 551, 19, 618, 358, 106, 84,
     537, 865, 507, 862, 905, 335, 195, 250, 54, 497,
     224, 612, 4, 16, 423, 52, 222, 421, 562, 140, 324,
     599, 295, 836, 887, 222, 790, 860, 917, 100, 348,
     141, 221, 575, 48, 411, 0, 245, 635, 631, 349, 646]

The relationship between X, Y, and my_data can be seen by adding a print statement to the for loop that constructs my_data:

....
    my_data[j].append(Y[i])
    print(f'X[{i}]:{X[i]:<6}Y[{i}]:{Y[i]:<6}my_data[{j}:{my_data[j]}')  

>>>
X[0]:1     Y[0]:590   my_data[1:[590]
X[1]:2     Y[1]:673   my_data[2:[673]
X[2]:3     Y[2]:49    my_data[3:[49]
X[3]:9     Y[3]:399   my_data[9:[399]
X[4]:10    Y[4]:551   my_data[10:[551]
X[5]:11    Y[5]:19    my_data[11:[19]
X[6]:50    Y[6]:618   my_data[50:[618]
X[7]:51    Y[7]:358   my_data[51:[358]
X[8]:52    Y[8]:106   my_data[52:[106]

>>>

You would probably be better off not creating the empty space in the first place and just pass the x's and y's to .plot using X as the argument for 'plot's positions parameter

# again fake Y data
y_s = [[thing] for thing in Y[:len(X)]]
plt.boxplot(y_s, positions=X)

This still leaves a lot of empty space in the plot. This can be fixed by segregating X and Y to slices of contiguous X values then creating subplots of the fragments using a loop (see Dynamically add/create subplots in matplotlib)

wwii
  • 23,232
  • 7
  • 37
  • 77