0

I am loading a csv file into a pandas dataframe. I would like to plot histograms of the resulting data.

Some of my columns are dates. Pandas uses the data type datetime64[ns] to store them. For my dates, I would like to put correct date formatted x-tick marks on the x-axis.

Here is some code that does not work:

import pandas
import numpy as np
import os
from datetime import datetime
from matplotlib import pyplot as plt

dirname='/my_working_dir/'
in_filename=os.path.join(dirname,'input_data.csv')
df = pandas.read_csv(in_filename,parse_dates=['Date of event'],dayfirst=True)

failures=df[df['Failure']==True];
suspensions=df[df['Failure']==False];

f=failures['Date of event'].dropna()
s=suspensions['Date of event'].dropna()

fig, ax = plt.subplots()
ax.hist([f,s],40,weights=[np.zeros_like(f) + 1. / f.size,
                         np.zeros_like(s) + 1. / s.size],
        color=['r','g']);
ax.set_yticklabels(['{:.0f}%'.format(x*100) 
                           for x in plt.gca().get_yticks()])
numbers=ax.get_xticks();
labels=map(lambda x: datetime.fromtimestamp(x).strftime('%Y-%m-%d'), numbers)
plt.xticks(numbers, labels)

Error:

Traceback (most recent call last):
   File "datetest.py", line 22, in <module>
    ax.hist([f,s],40,weights=[np.zeros_like(f) + 1. / f.size,
TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('float64')

I know that this is quite a bit of code, but the issue is with integrating the whole thing, and I am willing to change any piece (reading in the data, or plotting, or setting the xlabels) to get it to work.

Things I have tried:

  • making an integer version of the date data using df['int_date']=df['Date of event'].view('int64'). This lets me plot the histogram I need. The range of x is 1e18 to 1.5e18, and I can't figure out how to get proper date-formatted xticks.
  • trying to convert to a time stamp using df['test']=((df['Date of event'] - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')) (as suggested in another stack overflow post) I get: "TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''" My numpy is version 1.10.4 and I don't have the ability to install new libraries or upgrade on my system.

Here is some simplified content of the csv file (my real data is much larger):

Index,Date of event,Failure
12421,18/11/2016,TRUE
12409,01/05/2017,FALSE
12410,29/03/2017,FALSE
12453,21/08/2016,TRUE
12454,01/08/2016,TRUE

The answer in How can I convert pandas date time xticks to readable format? doesn't solve my problem - I can't even get to the point of having a plot with my data still in datetime64 format. In that question, there were working xticks but they just needed reformatting.

Thank you for any help you can provide.

moink
  • 798
  • 8
  • 20

1 Answers1

2

You have two problems.

The first is in the weights list. np.zeros_like(f) is not going to give anything useful, as first, f is a series, not a numpy array, second, it consists of dates, but what is zero in terms of dates?
What you really want here is a numpy array of zeros with the same length as f. This can be obtained via np.zeros(len(f)) or np.zeros(f.size).

Second, you cannot use the series directly, but need to take it's values: ax.hist([f.values, s.values]) instead of ax.hist([f, s])

So in total:

weights = [np.zeros(len(f)) + 1. / f.size,  np.zeros(len(s)) + 1. / s.size]
ax.hist([f.values, s.values],40,weights=weights, color=['r','g'])

At this point you may consider formatting the x axis, however, this will lead to new errors, so I would suggest to leave that out and if needed, stick to a solution similar to the one presented in this question How can I convert pandas date time xticks to readable format?

A complete example:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np

dates = pd.date_range("2013-01-01", "2017-06-20" )
y = np.cumsum(np.random.normal(size=len(dates)))
fail = np.random.choice([True, False], size=len(dates))

df = pd.DataFrame({'Date of event':dates, "y":y, 'Failure':fail})

failures=df[df['Failure']==True];
suspensions=df[df['Failure']==False];

f=failures['Date of event'].dropna()
s=suspensions['Date of event'].dropna()

fig, ax = plt.subplots()

weights=[np.zeros(len(f)) + 1. / f.size,  np.zeros(len(s)) + 1. / s.size]
ax.hist([f.values, s.values],40,weights=weights,
        color=['r','g'])


ax.set_yticklabels(['{:.1f}%'.format(x*100) 
                           for x in plt.gca().get_yticks()])
fig.autofmt_xdate()
plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • Thanks. Unfortunately it didn't work for me - I got some similar errors to some of my other debugging attempts:`Traceback (most recent call last): File "datetest2.py", line 25, in color=['r','g']) File "/var2/opt/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.py", line 1812, in inner return func(ax, *args, **kwargs) File "/var2/opt/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 5983, in hist raise ValueError("color kwarg must have one color per dataset") ValueError: color kwarg must have one color per dataset` – moink Jun 20 '17 at 14:18
  • And when I removed the color argument:`Traceback (most recent call last): File "datetest2.py", line 24, in ax.hist([f.values, s.values],40,weights=weights) File "/var2/opt/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.py", line 1812, in inner return func(ax, *args, **kwargs) File "/var2/opt/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 5996, in hist xmin = min(xmin, xi.min())` (continued in next comment) – moink Jun 20 '17 at 14:21
  • `File "/var2/opt/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py", line 29, in _amin return umr_minimum(a, axis, None, out, keepdims) ValueError: operands could not be broadcast together with shapes (789,) (843,)` – moink Jun 20 '17 at 14:21
  • Which version of matplotlib are you using? – ImportanceOfBeingErnest Jun 20 '17 at 14:30
  • My version of matplotlib is 1.5.1. Unfortunately I don't have the option of upgrading. – moink Jun 20 '17 at 14:36
  • I do have matplotlib 2.0.2 and unfortunately I'm not willing to downgrade just for testing. Are you able to produce a normal (using numbers, not dates) histogram with unequal sized arrays and different weights and colors? – ImportanceOfBeingErnest Jun 20 '17 at 14:46
  • If I change the histogram line to `ax.hist([f.values.tolist(), s.values.tolist()],40,weights=weights,color=['r','g'])` I get a histogram, but then the date formatting stops working, and I get x values in the range 1.48e18 to 1.5 e18 – moink Jun 20 '17 at 14:50
  • Yes, I have very similar code working nicely for all my other (non-date) columns. – moink Jun 20 '17 at 14:52
  • Yep ok, so I guess you have to update matplotlib or format the axes manually. – ImportanceOfBeingErnest Jun 20 '17 at 14:54
  • Yes, that's why my original question was about formatting the axes. But I can't seem to get that to work either. – moink Jun 20 '17 at 14:57
  • With your help and with the following additional code I got it to work: `numbers=ax.get_xticks(); labels=([datetime.fromtimestamp(number/1e9).strftime('%Y-%m-%d') for number in numbers]); plt.xticks(numbers, labels, rotation=45);` – moink Jun 20 '17 at 15:19