0

I would like to limit the y-axis boundaries based on the general range of my data, avoiding spikes but not removing them.

I am producing many sets of graphs comparing two sets of data. Both sets contain data over a year and have been read into dataframes with pandas and the graphs are produced via loop for each month. One of the sets has interment spikes which causes the range on the y-axis to be plotted much too large, resulting in an unreadable chart. Setting a fixed boundary with pyplot.ylim() doesn't help as the general range of the data (for example within one month) changes from chart to chart and applying a hard limit reduces the readability of many of the charts.

For example: one month may have data which generally does not go higher than a value of 300,000 but has several spikes which go way over 500,000 (and below -500,000), but another month may also have large spikes but data which does otherwise not go higher than a value of 150,000.

I've also tried setting values which are too large to nan using df2 = df[df.y < 500000] = np.nan based on this answer but the breaks in the line graph are too small to see and the fact that the spikes occur gets lost.

Is there some way to figure out what the general maximum and minimum range of the data is so that the y-axis limits can be set in a sensible way?

Andy Grey
  • 79
  • 5

1 Answers1

0

As I was writing this question something occurred to me and I solved it by making a copy of the dataframe, removing the very large values, then checking what the max and min values of the remaining data were.

def check_min_max(selected, selected2):
    max_test = selected2.copy(deep=True)
    
    #remove very large values
    max_test[(max_test[measurements_col] > 500000) | (max_test[measurements_col] < -500000)] = np.nan
    
    #get new max and min y-values
    measurements_y_max = max_test[measurements_col].max()
    measurements_y_min = max_test[measurements_col].min()

    results_y_max = selected[results_col].max()
    results_y_min = selected[results_col].min()
    
    if measurements_y_max > results_y_max:
        y_max = measurements_y_max
    else:
        y_max = results_y_max
        
    if measurements_y_min > 0 or results_y_min > 0:
        y_min = 0 - (y_max * 0.01)
    elif measurements_y_min < results_y_min:
        y_min = measurements_y_min
    else:
        y_min = results_y_min
    
    return(y_min + (y_min * 0.05), y_max + (y_max * 0.05)) # add 5% to range for readability

I'm also aware that there was no need to copy the dataframe after it was passed to the function. I'd originally written it as part of the code before I moved it to a function and haven't changed it yet

Andy Grey
  • 79
  • 5