Seaborn plots not correct

Question

import seaborn as sns, numpy as np
a = np.random.random((20, 20))
mask = np.zeros_like(a)
mask[np.tril_indices_from(mask)] = True #mask the lower triangle
with snenter code heres.axes_style("white"): #make the plot
    ax = sns.heatmap(a, xticklabels=False, yticklabels=False, mask=mask, square=False,  cmap="YlOrRd")
    plt.show()

I make a Seaborn heatmap from an upper triangle numpy array.

This code using pandas:

import pandas as pd
df = pd.read_csv('datatraining.txt', sep=r',', engine='python', header=None, names = ['id', 'date','Temperature','Humidity','Light','CO2','HumidityRatio','Occupancy'])
df = df.drop([0])
df.index = pd.to_datetime(df.date)
df.drop('date', axis=1, inplace=True)
df = df.apply(pd.to_numeric)
def scale(df):
    return (df - df.mean()) / df.std()
df.Temperature = scale(df.Temperature)
df.Humidity = scale(df.Humidity)
df.Light = scale(df.Light)
df.CO2 = scale(df.CO2)
df.HumidityRatio = scale(df.HumidityRatio)

Markus Bjorn · Answer 1 · 2022-05-23T22:47:07.497

0

Plots created using seaborn need to be displayed like ordinary matplotlib plots. This can be done using the

plt.show() function from matplotlib.

Originally I posted the solution to use the already imported matplotlib object from seaborn (sns.plt.show()) however this is considered to be a bad practice. Therefore, simply directly import the matplotlib.pyplot module and show your plots with

import matplotlib.pyplot as plt plt.show() If the IPython notebook is used the inline backend can be invoked to remove the necessity of calling show after each plot. The respective magic is

%matplotlib inline

Details which will affect how you store your data, like: Give as much detail as you can; and I can help you develop a structure.

Size of data, # of rows, columns, types of columns; are you appending rows, or just columns? What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these. (Giving a toy example could enable us to offer more specific recommendations.) After that processing, then what do you do? Is step 2 ad hoc, or repeatable? Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file? Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)? Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?

edited May 23 '22 at 22:47

answered May 23 '22 at 22:05

Markus Bjorn

1
2

df.info() - to view info – Markus Bjorn May 23 '22 at 22:34
null values was processed - pd.set_option('display.max_rows',None) df.isnull().sum() – Markus Bjorn May 23 '22 at 22:37
pd.set_option('display.max_rows',10) – Markus Bjorn May 23 '22 at 22:37
df.head() - next cell - df.info() – Markus Bjorn May 23 '22 at 22:37
Distribution of categorical variable - plt.figure(figsize=(10, 5)) sns.kdeplot(df['tests_units'].value_counts()) plt.title('Distribution tests_units') plt.xlabel('Значения') plt.ylabel('Распределение') plt.show() – Markus Bjorn May 23 '22 at 22:39
plt.figure(figsize=(10, 5)) sns.kdeplot(df['iso_code'].value_counts()) plt.title('Distribution iso_code') plt.xlabel('Значения') plt.ylabel('Распределение') plt.show() – Markus Bjorn May 23 '22 at 22:39
plt.figure(figsize=(10, 5)) sns.kdeplot(df['continent'].value_counts()) plt.title('Distribution continent') plt.xlabel('Значения') plt.ylabel('Распределение') plt.show() – Markus Bjorn May 23 '22 at 22:39
plt.figure(figsize=(10, 5)) sns.kdeplot(df['location'].value_counts()) plt.title('Distribution location') plt.xlabel('Значение') plt.ylabel('Распределение') plt.show() – Markus Bjorn May 23 '22 at 22:40
The function for calculating the distribution of each attribute that is a countable feature - def plot(column): plt.figure(figsize=(10, 5)) sns.kdeplot(df[column]) plt.title('Distribution '+column) plt.xlabel('Значения') plt.ylabel('Распределение') plt.show() – Markus Bjorn May 23 '22 at 22:41
Using function - for column in df[:100].select_dtypes(exclude=['object']).columns: plot(column) – Markus Bjorn May 23 '22 at 22:42
New Rt attr - df['Rt']=None data=pd.DataFrame() for country in df['location'].value_counts().keys(): r=df[df['location']==country].copy() da=pd.DataFrame() for i in range(0, len(r), 8): tida=pd.DataFrame() su=r['new_cases'].tail(8).tail(4).sum()/r['new_cases'].tail(8).head(4).sum() tida=r.tail(8) tida['Rt']=su r.drop(r.tail(8).index,inplace=True) da=da.append(tida) data=data.append(da) – Markus Bjorn May 23 '22 at 22:42
data=data.fillna(0) – Markus Bjorn May 23 '22 at 22:43
data.reset_index(drop=True, inplace=True) df=data – Markus Bjorn May 23 '22 at 22:43
d=pd.DataFrame({'Russia': [list(df[df['location']=='Russia']['Rt'])[0]], 'Mexico':[list(df[df['location']=='Mexico']['Rt'])[0]], 'France': [list(df[df['location']=='France']['Rt'])[0]], 'Taiwan':[list(df[df['location']=='Taiwan']['Rt'])[0]], 'United States':[list(df[df['location']=='United States']['Rt'])[0]], 'Japan':[list(df[df['location']=='Japan']['Rt'])[0]] – Markus Bjorn May 23 '22 at 22:44
, 'Canada':[list(df[df['location']=='Canada']['Rt'])[0]], 'Singapore':[list(df[df['location']=='Singapore']['Rt'])[0]],}).T – Markus Bjorn May 23 '22 at 22:45
plt.rcParams.update({'font.size': 15,}) plt.figure(figsize=(15, 8)) plots = sns.barplot(x=d.index, y=d[0], data=df) #Анализ некоторых стран for bar in plots.patches: plots.annotate(format(bar.get_height(), '.2f'), (bar.get_x() + bar.get_width() / 2, bar.get_height()), ha='center', va='center', size=15, xytext=(0, 8), textcoords='offset points') plt.title('Анализ эпидемиологической обстановки') plt.ylabel('Rt - значение') plt.xlabel('Страна') plt.show() – Markus Bjorn May 23 '22 at 22:45
df.to_csv('result_data.csv', encoding='utf-8-sig', index=False) – Markus Bjorn May 23 '22 at 22:45

Markus Bjorn · Answer 2 · 2022-05-23T22:10:30.303

0

I come to this question quite regularly and it always takes me a while to find what I search:

import seaborn as sns
import matplotlib.pyplot as plt

plt.show()  # <--- This is what you are looking for

Please note: In Python 2, you can also use sns.plt.show(), but not in Python 3.

Complete Example

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Visualize C_0.99 for all languages except the 10 with most characters."""

import seaborn as sns
import matplotlib.pyplot as plt

l = [41, 44, 46, 46, 47, 47, 48, 48, 49, 51, 52, 53, 53, 53, 53, 55, 55, 55,
     55, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58,
     58, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 61,
     61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62,
     62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 65,
     65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66,
     67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 70, 70,
     70, 70, 71, 71, 71, 71, 71, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73,
     74, 74, 74, 74, 74, 75, 75, 75, 76, 77, 77, 78, 78, 79, 79, 79, 79, 80,
     80, 80, 80, 81, 81, 81, 81, 83, 84, 84, 85, 86, 86, 86, 86, 87, 87, 87,
     87, 87, 88, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 92,
     92, 93, 93, 93, 94, 95, 95, 96, 98, 98, 99, 100, 102, 104, 105, 107, 108,
     109, 110, 110, 113, 113, 115, 116, 118, 119, 121]

sns.distplot(l, kde=True, rug=False)

plt.show()

Gives

this result

edited May 23 '22 at 22:10

answered May 23 '22 at 22:08

Markus Bjorn

1
2

df['day']=df['date'].apply(lambda x: x.day) – Markus Bjorn May 23 '22 at 22:50
df.replace([np.inf, -np.inf], np.nan, inplace=True) – Markus Bjorn May 23 '22 at 22:50
df=df[df['Rt']<5].reset_index(drop=True) – Markus Bjorn May 23 '22 at 22:50
Definition of the hazard variable – Markus Bjorn May 23 '22 at 22:51
df1=df[df['Rt']<=0.7] df1['Danger']=0 – Markus Bjorn May 23 '22 at 22:51
df2=df[(df['Rt']>0.7) & (df['Rt']<=0.95)] df2['Danger']=1 – Markus Bjorn May 23 '22 at 22:51
df3=df[df['Rt']>0.95] df3['Danger']=2 – Markus Bjorn May 23 '22 at 22:52
df=pd.concat([df1, df2, df3]).reset_index(drop=True) – Markus Bjorn May 23 '22 at 22:52
X=df[['new_cases', 'new_deaths', 'Rt']] y=df['Danger'] – Markus Bjorn May 23 '22 at 22:55
data sampling - from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y) – Markus Bjorn May 23 '22 at 22:55
#Данные визуализации сводной таблицы pt = pd.pivot_table(df, index='continent', columns='Danger', aggfunc='size', fill_value=0) #Приведение данных к единому процентному соотношению pt = pt.apply(lambda x: round(x / x.sum() * 100,1), axis=1) #Воспроизведение визуализации plt.figure(figsize=(12, 8), dpi=72) ax = sns.heatmap(pt, annot=True, linewidths=0.1, cmap="copper", fmt='g'); #Наименование осей и названия вазиулизации ax.set_title('Влияние фигурирования континента на целевую переменную', fontsize = 15, y=1.05) – Markus Bjorn May 23 '22 at 22:56
ax.set_xlabel('Уровень угрозы', fontsize = 13) ax.set_ylabel('Регион', fontsize = 13) plt.show() – Markus Bjorn May 23 '22 at 22:56
#Данные визуализации сводной таблицы pt = pd.pivot_table(df, index='tests_units', columns='Danger', aggfunc='size', fill_value=0) #Приведение данных к единому процентному соотношению pt = pt.apply(lambda x: round(x / x.sum() * 100,1), axis=1) #Воспроизведение визуализации plt.figure(figsize=(12, 8), dpi=72) ax = sns.heatmap(pt, annot=True, linewidths=0.1, cmap="copper", fmt='g'); #Наименование осей и названия вазиулизации ax.set_title('Влияние тестирования пациентов на целевую переменную', fontsize = 15, y=1.05) – Markus Bjorn May 23 '22 at 22:57
ax.set_xlabel('Категория тестирования', fontsize = 13) ax.set_ylabel('Уровень угрозы', fontsize = 13) plt.show() – Markus Bjorn May 23 '22 at 22:57
This graph shows testing of patients for Danger level. Above, you can see that the highest levels in terms of the number of tests are contained in the level of danger 2. The minimum indicators of testing are found among the minimum level of danger. – Markus Bjorn May 23 '22 at 22:57
ax = sns.regplot(x=df['new_cases'], y=df['Danger'], color="g") – Markus Bjorn May 23 '22 at 22:57
ax = sns.regplot(x=df['new_deaths'], y=df['Danger'], color="r") – Markus Bjorn May 23 '22 at 22:58
ax = sns.regplot(x=df['Rt'], y=df['Danger'], color="b") – Markus Bjorn May 23 '22 at 22:58
The three graphs above show a linear dependence of the Danger variable on the target variables X. As you can see, all three of these graphs have an almost ideal line going from the bottom to the top. This means that there is a clear dependence of the Danger variable on the variables above in this data. – Markus Bjorn May 23 '22 at 22:58
plt.figure(figsize=(17, 6)) sns.scatterplot(data=df, x='date', y='day', alpha=0.005, s=13, hue='Danger'); – Markus Bjorn May 23 '22 at 23:04
corrDf=df[['new_cases', 'new_deaths', 'new_tests', 'population', 'Rt', 'Danger']] corrMatrix = corrDf.corr() sns.heatmap(corrMatrix, annot=True, vmin=-1, vmax=1, cmap='coolwarm') plt.show() – Markus Bjorn May 23 '22 at 23:04
According to the Pearson correlation above, it can be seen that only Rt, the infection spread index, has the greatest influence on the Danger target variable. – Markus Bjorn May 23 '22 at 23:04
Consider three classification models KneighborsClassifier Neighbor-based classification is a type of instance-based or non-generalized learning: it does not attempt to build a general internal model, but simply stores instances of the training data. The classification is calculated by a simple majority vote of each point's nearest neighbors: the query point is assigned the data class that has the most representatives among the point's nearest neighbors. – Markus Bjorn May 23 '22 at 23:05
RandomForestClassifier Random forest is a meta-estimator that fits a number of decision tree classifiers on different subsamples of a dataset and uses averaging to improve prediction accuracy and control overfitting. The subsample size is controlled by max_samples if bootstrap=True (default), otherwise the entire dataset is used to build each tree – Markus Bjorn May 23 '22 at 23:05
GaussianNB A naive Bayes classifier is a simple probabilistic classifier based on the application of Bayes' theorem with strict (naive) independence assumptions. Depending on the precise nature of the probabilistic model, Naive Bayes classifiers can be trained very efficiently. – Markus Bjorn May 23 '22 at 23:06
Consider two metrics for evaluating a classification model accuracy f1-score This is the harmonic mean of the precision and recall values. Take it because it gives the best estimate of misclassified cases. macro avg f1-score macro avg f1-score is perhaps the simplest of the many averaging methods. The F1 macro-average score (or F1 macro-score) is calculated by taking the arithmetic mean (also known as the unweighted average) of all F1 scores for each class. This method will be taken as it treats all classes the same regardless of their support values – Markus Bjorn May 23 '22 at 23:06
Learning - #Models import from sklearn.metrics import classification_report from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB – Markus Bjorn May 23 '22 at 23:09
neigh = KNeighborsClassifier(n_neighbors=3) neigh.fit(X_train, y_train) preds=neigh.predict(X_test) print(classification_report(preds, y_test)) – Markus Bjorn May 23 '22 at 23:09
rfc = RandomForestClassifier() rfc.fit(X_train, y_train) rfc_preds=rfc.predict(X_test) print(classification_report(rfc_preds, y_test)) – Markus Bjorn May 23 '22 at 23:09
gnb = GaussianNB() gnb.fit(X_train, y_train) gnb_preds=gnb.predict(X_test) print(classification_report(gnb_preds, y_test)) – Markus Bjorn May 23 '22 at 23:09
The most optimal model will be KNeighborsClassifier with accuracy f1-score = 0.78 and macro avg f1-score = 0.74, since it showed the best result compared to others. RandomForestClassifier will not be taken because it has an explicit overfit – Markus Bjorn May 23 '22 at 23:10
Feature Engineering Let's transform the data set by generating new data in order to improve the accuracy of the classifier and use the StandardScaler – Markus Bjorn May 23 '22 at 23:10
Data generation - df['day']=df['date'].apply(lambda x: x.day) – Markus Bjorn May 23 '22 at 23:10
from sklearn.preprocessing import StandardScaler – Markus Bjorn May 23 '22 at 23:10
Transform with StandardScaler - scaler = StandardScaler() X=df[['new_cases', 'new_deaths', 'Rt', 'day']] y=df['Danger'] #Получение выборок X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y) #Обучение rfc = KNeighborsClassifier(n_neighbors=3) rfc.fit(X_train, y_train) rfc_preds=rfc.predict(X_test) print(classification_report(rfc_preds, y_test)) – Markus Bjorn May 23 '22 at 23:11
Conclusions on Feature Engineering From the results above, it can be seen that data transformation for Feature Engineering did not lead to an improvement in the model. Report 2.1 Splitting the dataset - the dataset is split into training and test sets 2.2 Visualizing data dependencies - visualizing data in multiple ways 2.3 Classification - 3 classification algorithms selected 2.4 Training - classified according to the level of danger 2.5 Feature Engineering - the data set was supplemented with additional data and the model was trained – Markus Bjorn May 23 '22 at 23:11
df.to_csv('result_data.csv', encoding='utf-8-sig', index=False) – Markus Bjorn May 23 '22 at 23:11

Seaborn plots not correct

2 Answers2

Complete Example