How can color up or highlighted outliers in scatter plot by using customized range for Vmin and Vmax for cmap?

Question

I tried to normalize the data by using Gaussian function 2 times on both positive and negative numbers of each parameter of this dataset. The dataset includes missing data as well. The problem is I want to highlight outliers via scatter graph by using cmap='coolwarm' for parameters A, B and specifically T so that:

outliers outside of that interval can be marked by (x) or (*) with cmap='coolwarm'
on the right side of the graph cbar is suppose to be available.
my aim is to highlight them in an elegant way before applying cleaning data then compare the raw data and processed data before & after graphs in the form of the subplot in one page.

Is it possible to highlight outliers by from sklearn.neighbors import LocalOutlierFactor? or defineing Vmin and Vmax inspiring from this answer or should I flag outliers before highlighting by Boolean masking (for the sake of learning) or define the function to detect them. my used code to color up outliers as follows:

def normalize(value, min_value, max_value, min_norm, max_norm):
    new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
    return new_value

def outlier_fix(data, _min, _max):
    for i in range (0, data.size):
        if (data.iat[i] > _max):
            data.iat[i] = _max
        if (data.iat[i] < _min):
            data.iat[i] = _min
    return data

def createpositiveandnegativelist(listtocreate):
    l_negative = []
    l_positive = []
    for value in listtocreate:
        if (value <= 0):
            l_negative.append(value)
        elif (value > 0):
            l_positive.append(value)
    #print(t_negative)
    #print(t_positive)
    return l_negative,l_positive

def calculatemean(listtocalculate):
    return sum(listtocalculate)/len(listtocalculate)

def plotboundedCI(s, mu, sigma, lists):
    plt.figure()
    '''
    print("\nS:\n",s)
    print("\nmuuu:\n",mu)
    print("\nsigma:\n",sigma)
    '''
    count, bins, ignored = plt.hist(s,30,density=True)
    plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp(-(bins-mu)**2/(2*sigma**2)),linewidth=2, color= 'r')
    #confidential interval calculation
    ci = scipy.stats.norm.interval(0.68, loc = mu, scale = sigma)
    #confidence interval for left line
    one_x12, one_y12 = [ci[0],ci[0]], [0,3]
    #confidence interval for right line
    two_x12, two_y12 = [ci[1],ci[1]], [0,3]
    '''
    print("\n\n\n",ci[0])
    print("\n\n\n",ci[1])
    '''
    plt.title("Gaussian 68% Confidence Interval", fontsize=12, color='black', loc='left', style='italic')
    plt.plot(one_x12, one_y12, two_x12, two_y12, marker = 'o')
    #plt.show()


    results = []
    for value in lists:
        if(ci[0]< value <ci[1]):
            results.append(value)
        else:
            #print("NOT WANTED: ",value)
            pass

    return results

df_orig = df.copy()
df_orig[df_orig == np.inf] = np.nan
df_orig[df_orig == -np.inf] = np.nan

def miss_contain_cycles(data):
    miss_cycles = []

    for i in range(math.ceil(data.shape[0] // 480)):
        temp = data[i*480:(i+1)*480]
        if np.sum(temp == np.inf) > 0 or np.sum(temp == -np.inf) > 0 or np.sum(np.isnan(temp)) > 0:
            miss_cycles.append(i)

    return miss_cycles

def missing_stats(data):
    inf_stats = np.sum(data == np.inf)
    minus_inf_stats = np.sum(data == -np.inf)
    nan_stats = np.sum(np.isnan(data))

    miss_cycles = miss_contain_cycles(data)

    return inf_stats, minus_inf_stats, nan_stats, miss_cycles


dft = pd.read_csv('me_300_SOF.csv', header=None)
df_plot.columns = ['A', 'B' ,'T','S','C','Cycle']

fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(20,10), squeeze=False)

df_plot.plot.scatter(ax=ax[0, 0] , alpha=0.8 , x='Cycle', y='A', colormap='coolwarm', c='A') ; ax[0, 0].set_title('A Vs Cycle', fontweight='bold', fontsize=14) ; ax[0, 0].set_ylabel('A')
df_plot.plot.scatter(ax=ax[1, 0] , alpha=0.8 , x='Cycle', y='B', colormap='coolwarm', c='B') ; ax[1, 0].set_title('B Vs Cycle', fontweight='bold', fontsize=14) ; ax[1, 0].set_ylabel('B')
df_plot.plot.scatter(ax=ax[2, 0] , alpha=0.8 , x='Cycle', y='T', colormap='coolwarm', c='T') ; ax[2, 0].set_title('C Vs Cycle', fontweight='bold', fontsize=14) ; ax[2, 0].set_ylabel('T') 

plt.suptitle('Exploratory Data Analysis (EDA) ', color='yellow', backgroundcolor='black', fontsize=15, fontweight='bold')
plt.subplots_adjust(top=0.9, bottom=0.07, left=0.06, right=0.96, hspace=0.4, wspace=0.2)
plt.show()

Any help would be greatly appreciated!

Consider revisiting the help center on how to ask questions here if you feel this has not received enough traction. — ImportanceOfBeingErnest, Jul 28 '19 at 13:37
I'd detect them and use the the `c` keyword of `scatter` to put an according sequence. — mikuszefski, Jul 29 '19 at 08:18
@mikuszefski would you plz to check my [full code](https://drive.google.com/file/d/18Gd7IWIge_9IQjsvmy76-F1nkFZWOsO0/view?usp=sharing). I tried to use **confidence interval** to detect outliers that are outside of that and overlay them on the plot of raw data to highlit them by (x) and in the end, I normalised data between [-1,+1]. I tried this on a data which is free of missing data, but the result was not good as you see [here](https://i.imgur.com/j99oKIB.png). Plz take into account I'm applying this process on both negative and positive numbers of 3 parameters A, B, T base on cycle. — Mario, Jul 31 '19 at 02:38
Hi Mario, can you reduce the amount of code to make your example minimal and maybe add a function to create some generic data to work with? That would help a lot. — mikuszefski, Jul 31 '19 at 15:00

How can color up or highlighted outliers in scatter plot by using customized range for Vmin and Vmax for cmap?

0 Answers0

Linked