1

I'm going to work on a machine learning project with lots of data soon, so i tried to emulate the type of challenge I will face to prepare accordingly. The first one is plotting datetime (date on x axis and hour on y axis) from a .cvs like the following (with this type of data from April of 2021 and April 2020):

following file

What I have now is the following plot from the 2021 set:

plot

But I don't see why the dots aren't on the grid correctly, for instance, my latest point on April of 2021 should be the last date i have on the file (which is 07:37:56 30/04/2021):

which is 07:37:56 30/04/2021

But as you can see on the plot below:

plot

The last dot on April is closer to 07:00:00 than 08:00:00 on the y ticks, and is in the middle of the x ticks, like the middle of the day 29 and 30. This is just an example of what's happening all over the plot, which I don't really know why is happening.

My code on Google Colab is like this

#LIBRARIES AND PRE PROCESSING
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as pltd
import numpy as np
from datetime import datetime

caminho_dados = "/content/sample.csv"
df = pd.read_csv(caminho_dados,encoding='UTF-8',sep=',')

novo=df['ID;Hora;Data;;;'].str.split(';').str
df['ID']=novo[0]
df['Hora']=novo[1]
df['Data']=novo[2]

df.drop(columns=['ID;Hora;Data;;;'],inplace=True)

df['Data']=pd.to_datetime(df['Data'],format ='%d/%m/%Y')
df['Hora']=pd.to_datetime(df['Hora'],format ='%H:%M:%S')

#SEPARATING THE TWO APRILS I HAVE ON TWO DATAFRAMES, 2021 ONE & 2020 ONE
df_vinteum = df.iloc[:50, :]
df_vinte = df.iloc[50:, :]

#PLOTTING
fig, ax = plt.subplots()
ax.scatter(df_vinteum['Data'], df_vinteum['Hora'])

ax.xaxis.set_major_formatter(pltd.DateFormatter('%d'))
ax.yaxis.set_major_formatter(pltd.DateFormatter('%H'))

start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))

start, end = ax.get_ylim()
ax.yaxis.set_ticks(np.arange(start,end,(end-start)/24))

plt.title("Abril 2021")
plt.xlabel('Dias')
plt.ylabel('Horas')
plt.xticks(rotation=45)
plt.grid()

plt.rcParams["figure.figsize"] = (20,6)

plt.show()
alpakyol
  • 2,402
  • 2
  • 29
  • 35

1 Answers1

1

This is happening because you are taking the default limits that matplotlib is generating, but these aren't necessarily going to round to the exact start of each hour. I created a smaller sample data set to recreate the same issue:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as pltd

## create sample data
hora =  ['12:06:24','20:49:00','07:37:56']
dias = ['14/04/2021','15/04/2021','30/04/2021']

df = pd.DataFrame({'Hora':hora, 'Data':dias})

Notice that if I modify your code to show minutes on the y-axis ax.yaxis.set_major_formatter(pltd.DateFormatter('%H:%M')), you can see that the actual y-ticks are not exactly on the hour:

enter image description here

And the same issue is happening on the x-axis. You can use DayLocator and HourLocator (which are described in the documentation here with a further explanation in this answer).

If you pass DayLocator and HourLoactor objects to the functions ax.xaxis.set_major_locator and ax.yaxis.set_major_locator, you can correct the scatterplot tickmarks (full code below):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as pltd

## create sample data
hora =  ['12:06:24','20:49:00','07:37:56']
dias = ['14/04/2021','15/04/2021','30/04/2021']

df = pd.DataFrame({'Hora':hora, 'Data':dias})

df['Data']=pd.to_datetime(df['Data'],format ='%d/%m/%Y')
df['Hora']=pd.to_datetime(df['Hora'],format ='%H:%M:%S')

#PLOTTING
fig, ax = plt.subplots()
ax.scatter(df['Data'], df['Hora'])

ax.xaxis.set_major_formatter(pltd.DateFormatter('%d'))
ax.yaxis.set_major_formatter(pltd.DateFormatter('%H:%M'))

days = pltd.DayLocator(interval = 1)
ax.xaxis.set_major_locator(days)

hours = pltd.HourLocator(interval = 1)
ax.yaxis.set_major_locator(hours)

plt.title("Abril 2021")
plt.xlabel('Dias')
plt.ylabel('Horas')
plt.xticks(rotation=45)
plt.grid()

plt.rcParams["figure.figsize"] = (20,6)

plt.show()

enter image description here

Derek O
  • 16,770
  • 4
  • 24
  • 43