2
# first, some imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let's say I want to make a scatter plot, using this data:

np.random.seed(42)
x=np.arange(0,50)
y=np.random.normal(loc=3000,scale=1,size=50)

Plot via:

plt.scatter(x,y)

I get this answer:

enter image description here

Ok, let's create a dataframe first:

df=pd.DataFrame.from_dict({'x':x,'y':y.astype(str)})

(I am aware that I am storing y as str - this is a reproducible example, and I do this to reflect the real use case.)

Then, if I do:

plt.scatter(df.x,df.y)

I get:

enter image description here


What am I seeing in this second plot? I thought that the second plot must be showing the x column plotted against the y column, which are converted to float. This is clearly not the case.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
zabop
  • 6,750
  • 3
  • 39
  • 84

3 Answers3

4

Matplotlib doesn't automatically convert str values to numerical, so your y values are treated as categorical. As far as Matplotlib is concerned, the differences '1.0' to '0.9' and '1.0' to '100.0' are not different.

So, the y-axis on the plot will be the same as range(len(y)) (since the difference between all categorical values is the same) with labels assigned from the categorical values.

Since your x is a range equal to range(50), and now your y is a range too (also equal to range(50)), it plots x = y, with y-labels set to respective str value.

dm2
  • 4,053
  • 3
  • 17
  • 28
3

As per the excellent answer by dm2, when you pass y as a string, y is simply being treated as arbitrary string labels, and being plotted one after the other in the order in which they appear. To demonstrate, here's an even simpler example.

from matplotlib import pyplot as plt
x = [1, 2, 3, 4]
y = [5, 25, 10, 1] # these are ints
plt.scatter(x, y)

plot with y given as integers

So far so good. Now, different string y values.

y = list("abcd")
plt.scatter(x, y)

plot with y as just some labels

You can see how it just takes the y labels and just drops them on the axis one after another.

Finally,

y = ["5", "25", "10", "1"]
plt.scatter(x, y)

plot with y having strings of integers

Compare this with the previous results and now it should become obvious what's going on.

Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
1
  • It's more obvious if the labels and locations are extracted, that the API plots the strings as labels, and the axis locations are 0 indexed numbers based on the how many (len) categories exist.
  • .get_xticks() and .get_yticks() extract a list of the numeric locations.
  • .get_xticklabels() and .get_yticklabels() extract a list of matplotlib.text.Text, Text(x, y, text).
  • There are fewer numbers in the list for the y axis because there were duplicate values as a result of rounding.
  • This applies to any APIs, like seaborn or pandas that use matplotlib as the backend.
    • sns.scatterplot(data=df, x='x_num', y='y', ax=ax1)
    • ax1.scatter(data=df, x='x_num', y='y')
    • ax1.plot('x_num', 'y', 'o', data=df)

Labels, Locs, and Text

print(x_nums_loc)
print(y_nums_loc)
print(x_lets_loc)
print(y_lets_loc)
print(x_lets_labels)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

[Text(0, 0, 'A'), Text(1, 0, 'B'), Text(2, 0, 'C'), Text(3, 0, 'D'), Text(4, 0, 'E'),
 Text(5, 0, 'F'), Text(6, 0, 'G'), Text(7, 0, 'H'), Text(8, 0, 'I'), Text(9, 0, 'J'),
 Text(10, 0, 'K'), Text(11, 0, 'L'), Text(12, 0, 'M'), Text(13, 0, 'N'), Text(14, 0, 'O'),
 Text(15, 0, 'P'), Text(16, 0, 'Q'), Text(17, 0, 'R'), Text(18, 0, 'S'), Text(19, 0, 'T'),
 Text(20, 0, 'U'), Text(21, 0, 'V'), Text(22, 0, 'W'), Text(23, 0, 'X'), Text(24, 0, 'Y'),
 Text(25, 0, 'Z')]

Imports, Data, and Plotting

import numpy as np
import string
import pandas as pd
import matplotlib.pyplot as plt
import string

# sample data
np.random.seed(45)
x_numbers = np.arange(100, 126)
x_letters = list(string.ascii_uppercase)
y= np.random.normal(loc=3000, scale=1, size=26).round(2)
df = pd.DataFrame.from_dict({'x_num': x_numbers, 'x_let': x_letters, 'y': y}).astype(str)

# plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3.5))
df.plot(kind='scatter', x='x_num', y='y', ax=ax1, title='X Numbers', rot=90)
df.plot(kind='scatter', x='x_let', y='y', ax=ax2, title='X Letters')

x_nums_loc = ax1.get_xticks()
y_nums_loc = ax1.get_yticks()

x_lets_loc = ax2.get_xticks()
y_lets_loc = ax2.get_yticks()

x_lets_labels = ax2.get_xticklabels()

fig.tight_layout()
plt.show()

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158