As the title suggests, i would like to plot data availability, at any one time for each station. The plot can be thought to be a map or scatter plot, where the station number and time are the coordinates. Which will plot vertical lines, where there is data (i.e. floats/integers), and as a white space if data is missing (ie. NANs), temporal resolution is daily.
Similar to the plot at the end of the post. Which is from the output of an R package, 'Climatol' (homogen function).
I would like to know if there is similar way of plotting in PYTHON, I preferably don't want to use the R package, as it does more than just the plot, and hence will take a lot of hours for thousands of station data.
Some sample data (daily time series) of each stations would be like ;
station1 = pd.DataFrame(pd.np.random.rand(100, 1)).set_index(pd.date_range(start = '2000/01/01', periods = 100))
station2 = pd.DataFrame(pd.np.random.rand(200, 1)).set_index(pd.date_range(start = '2000/03/01', periods = 200))
station3 = pd.DataFrame(pd.np.random.rand(300, 1)).set_index(pd.date_range(start = '2000/06/01', periods = 300))
station4 = pd.DataFrame(pd.np.random.rand(50, 1)).set_index(pd.date_range(start = '2000/09/01', periods = 50))
station5 = pd.DataFrame(pd.np.random.rand(340, 1)).set_index(pd.date_range(start = '2000/01/01', periods = 340))
Real sample data; https://drive.google.com/drive/folders/15PwpWIh13tyOyzFUTiE9LgrxUMm-9gh6?usp=sharing Code to open for two stations;
import pandas as pd
import numpy as np
df1 = pd.read_csv('wgenf - 2019-04-17T012724.318.genform1_proc',skiprows = 8,delimiter = ' ')
df1.drop(df1.tail(6).index,inplace=True)
df1 = df1.iloc[:,[1,3]]
df1.iloc[:,1].replace('-',np.nan,inplace=True)
df1 = df1.dropna()
df1['Date(NZST)'] = pd.to_datetime(df1.iloc[:,0],format = "%Y %m %d")
df1 = df1.set_index('Date(NZST)')
df2 = pd.read_csv('wgenf - 2019-04-17T012830.116.genform1_proc',skiprows = 8,delimiter = ' ')
df2.drop(df2.tail(6).index,inplace=True)
df2 = df2.iloc[:,[1,3]]
df2.iloc[:,1].replace('-',np.nan,inplace=True)
df2 = df2.dropna()
df2['Date(NZST)'] = pd.to_datetime(df2.iloc[:,0],format = "%Y %m %d")
df2 = df2.set_index('Date(NZST)')
Expanding Asmus's code (Answer below) for multiple stations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import glob as glob
start = '1900/01/01'
end = '2018/12/31'
counter = 0
filenames = glob.glob('data/temperature/*.genform1_proc')
for filename in filenames:
with open(filename, newline='') as f:
### read the csv file with pandas, using the correct tab delimiter
df1 = pd.read_csv(f,skiprows = 8,delimiter = '\t',)
df1.drop(df1.tail(8).index,inplace=True)
### replace invalid '-' with useable np.nan (not a number)
df1.replace('-',np.nan,inplace=True)
df1['Date(NZST)'] = pd.to_datetime(df1['Date(NZST)'],format = "%Y %m %d")
df1 = df1.set_index('Date(NZST)',drop=False)
### To make sure that we have data on all dates:
# create a new index, based on the old range, but daily frequency
idx = pd.date_range(start,end,freq="D")
df1=df1.reindex(idx, fill_value=np.nan)
### Make sure interesting data fields are numeric (i.e. floats)
df1["Tmax(C)"]=pd.to_numeric(df1["Tmax(C)"])
### Create masks for
# valid data: has both date and temperature
valid_mask= df1['Tmax(C)'].notnull()
### decide where to plot the line in y space,
ys=[counter for v in df1['Tmax(C)'][valid_mask].values]
plt.scatter(df1.index[valid_mask].values,ys,s=30,marker="|",color="g")
plt.show()
counter +=1
code above currently plots the one below.