-1

Violin Plot

Hi everyone, im currently just trying around some different visualisation methods in Kaggle and I stumbled upon the violinplot in seaborn. Even though my data set only contains positive numbers (0-1900ish), the violinplot still starts at -100 and overshoots to 2000? Did i do something wrong, or is this intentional? Thanks in advance!

f, axs = plt.subplots(1, 2, figsize=(15,6))

plt.subplot(1, 2, 1)
sns.violinplot(data=visu_train_data, x='Side', y='Num', hue='Transported', split=True)
plt.xlabel('Side')
plt.ylabel('Num')
plt.title('Transported Status by Side and Num')
plt.grid(True)

plt.subplot(1, 2, 2)
sns.violinplot(data=visu_train_data, x='Deck', y='Num', hue='Transported', split=True)
plt.xlabel('Deck')
plt.ylabel('Num')
plt.title('Transported Status by Deck and Num')
plt.grid(True)

plt.show()

This was my code

Data set info

As you can see, my Num-Variable has a min-value of 0 and max-value of 1900ish

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
SAugustus
  • 27
  • 2

2 Answers2

3

This is normal, what is shown by a violinplot is a Kernel Density Estimation, which extrapolates the data.

If you want to restrict to the existing range pass cut=0:

import seaborn as sns

df = pd.DataFrame({'col': np.random.randint(1, 100, size=100)})

sns.violinplot(df, cut=0)

Output:

enter image description here enter image description here

mozway
  • 194,879
  • 13
  • 39
  • 75
1

The first thing to do is to read the documentation on violinplot:

Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

This can be an effective and attractive way to show multiple distributions of data at once, but keep in mind that the estimation procedure is influenced by the sample size, and violins for relatively small samples might look misleadingly smooth.

9769953
  • 10,344
  • 3
  • 26
  • 37