1

Before asking this question, I spent a day yesterday looking for an answer in previous Stack Overflow answers as well as the Internet, but I couldn't find the solution to my problem.

I have a data frame for oil production in the US over time. The data includes the date column and corresponding values. The minimum reproducible code for the data is below:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/Arsik36/SO/master/Oil%20production.csv',
                parse_dates = ['date'], 
                 index_col = 'date')

I use the below code to visualize a general trend in oil production over time:

# Visualizing Time Series
df.value.plot(title = 'Oil production over time')

# Specifying naming convention for x-axis
plt.xlabel('Date')

# Specifying naming convention for y-axis
plt.ylabel('Oil production volume')

# Improving visual aesthetics
plt.tight_layout()

# Showing the result
plt.show()

By running this code in your environment, you see that the plot shows distribution of values over time. What I struggle with is either separate plot into subplots by years (for example, 1995 - 1997), or show different lines for each year on one graph

df['1995' : '1997'].value.plot(title = 'Oil production over time', subplots = True)

When I use this code, it correctly subsets my data for only years 1997, and with subplots = True The graph is indeed separated by year. However, by running this in your environment, you can see that graph is separated by year on the x-axis, but utilizes 1 line to show results for all 3 years. What I am trying to do is to either separate a plot into 3 subplots for years 1995, 1996, and 1997, or to show 3 lines in one plot, each line corresponding to a unique year.

It is important to me to be able to do this by keeping the date column as the index column without creating any additional columns (if possible) to solve this problem.

Thank you in advance for your help.

Arsik36
  • 277
  • 3
  • 16

1 Answers1

1

You're right suggesting that there's no implemented solution for python, I know that in R has an implementation for this in fpp2 .

The solution I've come up with is to get the data from each year from your data and plot it consecutively in a for loop.

years=[1995,1996,1997]

fig,ax=plt.subplots(figsize=(10,30))

for i in years:
    aux=df[df.index.map(lambda x : x.year == i)] #slice the data for each year
    aux.reset_index(inplace=True, drop=True) #we need to drop the index in order to be able to plot all lines in the same timeframe.

    #afterwards an index is given to all the series
    aux.set_index(pd.date_range(pd.to_datetime('01-01-2000'),periods=aux.shape[0], freq='W'),inplace=True)
    ax.set_xticklabels(aux.index, rotation = 90)
    ax.plot(aux.values, label=str(i))
    plt.legend()

fig.autofmt_xdate() #to be able to see the dates clearly

fig.show()

This yields a result like this:

resulting plot

The only thing left to do would be to format the x axis labels so only the months are displayed.

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • Your solution works, thank you! One question - why did you specify '01-01-2000' in the pd.to_datetime() method? – Arsik36 Aug 15 '20 at 18:44
  • Because you needed a common date for all dataframes in the loop, but it's irrelevant what date you add as long as the first day of the year is the start. If you use different dates for each year you won't be able to plot them. – Ignacio Valenzuela Aug 15 '20 at 18:54
  • If this answer has resolved your issue I'd appreciate you accepting it as the solution. – Ignacio Valenzuela Aug 15 '20 at 18:55
  • It partially solves my problem, thank you. I will now try to work out a way to make x-axis output correct labels. – Arsik36 Aug 15 '20 at 19:07
  • To format x axis, you might find [Pandas bar plot changes date format](https://stackoverflow.com/q/30133280/10315163) useful. – Ynjxsjmh Aug 16 '20 at 01:41