0

I need some guidance to plot:

  1. scatter plot of df1 data: time vs y use the hue for the column z
  2. line plot df2 data: time vs. y
  3. a single line at y=c (c is a constant)

y data in df1 and df2 are different but they are in the same range.

I do not know where to begin. Any guidance is appreciated.

More explanation. A portion of data is presented here. I want to plot:

  1. scatter plot of time vs CO2
  2. finding the yearly rolling average of CO2 (from 01/01/2016 to 09/30/2019 based on hourly data. So the first average will be from "01/01/2016 00" to "12/31/2016 23" and second average will be from "01/01/2016 01" to "01/01/2017 00") (like the trend in plot below)
  3. finding the maximum of all the data and through a line over the plot (like straight line below)

enter image description here

Sample data

data = {'Date':['0     01/14/2016 00', '01/14/2016 01','01/14/2016 02','01/14/2016 03','01/14/2016 04','01/14/2016 05','01/14/2016 06','01/14/2016 07','01/14/2016 08','01/14/2016 09','01/14/2016 10','01/14/2016 11','01/14/2016 12','01/14/2016 13','01/14/2016 14','01/14/2016 15','01/14/2016 16','01/14/2016 17','01/14/2016 18','01/14/2016 19'],
        'CO2':[2415.9,2416.5,2429.8,2421.5,2422.2,2428.3,2389.1,2343.2,2444.,2424.8,2429.6,2414.7,2434.9,2420.6,2420.5,2397.1,2415.6,2417.4,2373.2,2367.9],
        'Year':[2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016]} 

# Create DataFrame 
df = pd.DataFrame(data)

# DataFrame view
                Date     CO2  Year
 0     01/14/2016 00  2415.9  2016
       01/14/2016 01  2416.5  2016
       01/14/2016 02  2429.8  2016
       01/14/2016 03  2421.5  2016
       01/14/2016 04  2422.2  2016
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • Some research on matplolib doc will open the path : https://matplotlib.org/3.1.1/gallery/text_labels_and_annotations/legend_demo.html#sphx-glr-gallery-text-labels-and-annotations-legend-demo-py – Florian Bernard Oct 31 '19 at 17:32
  • Please [provide a reproducible copy of the DataFrame with `to_clipboard`](https://stackoverflow.com/questions/52413246/provide-a-reproducible-copy-of-the-dataframe-with-to-clipboard/52413247#52413247) – Trenton McKinney Oct 31 '19 at 17:38
  • 1
    pass an axis object to the plot command, e.g. `df.plot(ax=ax); df2.plot.scatter(ax=ax)`... – Quang Hoang Oct 31 '19 at 18:22

2 Answers2

1

You can use a dual-axis chart. It will ideally look the same as yours because both the axes will be the same scale. Can directly plot using pandas data frames

import matplotlib.pyplot as plt
import pandas as pd

# create a color map for the z column
color_map = {'z_val1':'red', 'z_val2':'blue', 'z_val3':'green', 'z_val4':'yellow'}

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twinx() #second axis within the first

# define scatter plot
df1.plot.scatter(x = 'date',
                 y = 'CO2',
                 ax = ax1,
                 c = df['z'].apply(lambda x:color_map[x]))

# define line plot
df2.plot.line(x = 'date',
         y = 'MA_CO2', #moving average in dataframe 2
         ax = ax2)


# plot the horizontal line at y = c (constant value)
ax1.axhline(y = c, color='r', linestyle='-')

# to fit the chart properly
plt.tight_layout()
mank
  • 884
  • 1
  • 6
  • 17
1

using matplotlib.pyplot:

  • plt.hlines to add a horizontal line at a constant
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# with synthetic data
np.random.seed(365)
data = {'CO2': [np.random.randint(2000, 2500) for _ in range(783)],
        'Date': pd.bdate_range(start='1/1/2016', end='1/1/2019').tolist()}

# create the dataframe:
df = pd.DataFrame(data)

# verify Date is in datetime format
df['Date'] = pd.to_datetime(df['Date'])

# set Date as index so .rolling can be used
df.set_index('Date', inplace=True)

# add rolling mean
df['rolling'] = df['CO2'].rolling('365D').mean()

# plot the data
plt.figure(figsize=(8, 8))
plt.scatter(x=df.index, y='CO2', data=df, label='data')
plt.plot(df.index, 'rolling', data=df, color='black', label='365 day rolling mean')
plt.hlines(max(df['CO2']), xmin=min(df.index), xmax=max(df.index), color='red', linestyles='dashed', label='Max')
plt.hlines(np.mean(df['CO2']), xmin=min(df.index), xmax=max(df.index), color='green', linestyles='dashed', label='Mean')
plt.xticks(rotation='45')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

Plot using synthetic data:

enter image description here

Issues with the Date format in the data from the op:

  • Use a regular expression to fix the Date column
  • Place the code to fix Date, just before df['Date'] = pd.to_datetime(df['Date'])
import re

# your data
                Date     CO2  Year
 0     01/14/2016 00  2415.9  2016
       01/14/2016 01  2416.5  2016
       01/14/2016 02  2429.8  2016
       01/14/2016 03  2421.5  2016
       01/14/2016 04  2422.2  2016

df['Date'] = df['Date'].apply(lambda x: (re.findall(r'\d{2}/\d{2}/\d{4}', x)[0]))

# fixed Date column
       Date     CO2  Year
 01/14/2016  2415.9  2016
 01/14/2016  2416.5  2016
 01/14/2016  2429.8  2016
 01/14/2016  2421.5  2016
 01/14/2016  2422.2  2016
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • This is wonderful! Does .rolling('365D').mean() works for my my data as well? 0 01/14/2016 00, 01/14/2016 01,01/14/2016 02 ------------------- as you can see, some data are missing, in one day I may have 24 data and another one I may have 10 data, we may even have some missing days. How does it move over the data? – Pouyan Ebrahimi Oct 31 '19 at 19:20
  • @PouyanEbrahimi No, you have to fix your date to be an actual datetime format. What are you importing the data from (csv, txt)? Please paste the top 5 rows of data into your question, as they appear in the file. It doesn't seem like the data is being imported correctly, based on the `Date` data. – Trenton McKinney Oct 31 '19 at 19:27
  • @PouyanEbrahimi I've add a line of code at the bottom to fix your `Date` column – Trenton McKinney Oct 31 '19 at 19:40