0

So Im new in the field of data science, the thing is I have a dataset practice with so what Im trying to do is this:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

file = pd.read_csv('datasets/office_episodes.csv')

x = np.array(file.loc[:,'episode_number'])

y = np.array(file.loc[:, 'viewership_mil'])

scaled_ratings = np.array(file.loc[:, 'scaled_ratings'])

ratings2 = list(scaled_ratings)
   
plt.title("Popularity, Quality, and Guest Appearances on the Office")

plt.xlabel("Episode Number")

plt.ylabel("Viewership (Millions)")

for i in ratings2:
    if i < 0.25:
         plt.scatter(x, y, c='red')
    elif i >=0.25 and i < 0.50:
          plt.scatter(x, y, c='orange')   
    elif i >= 0.50 and i < 0.75:
        plt.scatter(x, y, c='lightgreen')
    elif i >= 0.75:
        plt.scatter(x, y, c='darkgreen')
    else:
        plt.scatter(x, y, c='pink')


plt.show()

As you can see in the for loop Im conditioning the colors of the dots in the scatter plot based on the scale ratings but when plot is displayed it looks like this:

image

I also tried to create a variable called ratings3 that contains ratings2, so in that way I could make a list comprehension so in that I could pass ratings3 in the color paramater of the plt.scatter() function.

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
TheMax370
  • 17
  • 6

2 Answers2

0

I am not an expert at this, but here is my solution. You would first have to make separate arrays for each category. Then you can plot each with the chosen colors.

y1 = np.array(file.loc[file['scaled_ratings'] < 0.25, 'viewership_mil'])
y2 = np.array(file.loc[0.25 <= file['scaled_ratings'] < 0.5, 'viewership_mil'])
y3 = np.array(file.loc[0.5 <= file['scaled_ratings'] < 0.75, 'viewership_mil'])
y4 = np.array(file.loc[0.75 <= file['scaled_ratings'], 'viewership_mil'])

plt.scatter(x, y1, c='red')
plt.scatter(x, y2, c='orange')
plt.scatter(x, y3, c='lightgreen')
plt.scatter(x, y4, c='darkgreen')
A.M. Ducu
  • 892
  • 7
  • 19
0

Some sample data and imports:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

n = 175
np.random.seed(15)
df = pd.DataFrame({
    'episode_number': np.random.randint(0, 180, n),
    'viewership_mil': np.random.randint(2_500_000, 12_500_000, n) / 1_000_000
})
df['scaled_ratings'] = df['viewership_mil'] / df['viewership_mil'].sum() * 100

df.head():

   episode_number  viewership_mil  scaled_ratings
0             140       12.414172        0.925457
1             133        9.918293        0.739393
2             119        7.513288        0.560104
3             128       11.664907        0.869600
4             156        8.610445        0.641895

Create categories based on scaled_ratings using pd.cut:

colors = pd.cut(
    df['scaled_ratings'],
    bins=[np.NINF, 0.25, .5, .75, np.inf],
    labels=['red', 'orange', 'lightgreen', 'darkgreen'],
    right=False
)

colors.head():

0       darkgreen
1      lightgreen
2      lightgreen
3       darkgreen
4      lightgreen

Then plot scatter and specify c=:

fig, ax = plt.subplots()
ax.scatter(x=df['episode_number'], y=df['viewership_mil'], c=colors)
plt.title("Popularity, Quality, and Guest Appearances on the Office")
plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.show()

plot 1


All together:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

n = 175
np.random.seed(15)
df = pd.DataFrame({
    'episode_number': np.random.randint(0, 180, n),
    'viewership_mil': np.random.randint(2_500_000, 12_500_000, n) / 1_000_000
})
df['scaled_ratings'] = df['viewership_mil'] / df['viewership_mil'].sum() * 100

# Assign Colors based on df['scaled_ratings']
colors = pd.cut(
    df['scaled_ratings'],
    bins=[np.NINF, 0.25, .5, .75, np.inf],
    labels=['red', 'orange', 'lightgreen', 'darkgreen'],
    right=False  # Lower-bound inclusive x >= .25 and x < .5
)
# Plot
fig, ax = plt.subplots()
ax.scatter(x=df['episode_number'], y=df['viewership_mil'], c=colors)
plt.title("Popularity, Quality, and Guest Appearances on the Office")
plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.show()
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57