2

I'm looking for a way to convert dates given in the format YYYYmmdd to an np.array with dtype='datetime64'. The dates are stored in another np.array but with dtype='float64'.

I am looking for a way to achieve this by avoiding Pandas!

I already tried something similar as suggested in this answer but the author states that "[...] if (the date format) was in ISO 8601 you could parse it directly using numpy, [...]".

As the date format in my case is YYYYmmdd which IS(?) ISO 8601 it should be somehow possible to parse it directly using numpy. But I don't know how as I am a total beginner in python and coding in general.

I really try to avoid Pandas because I don't want to bloat my script when there is a way to get the task done by using the modules I am already using. I also read it would decrease the speed here.

zorrolo
  • 117
  • 9

2 Answers2

3

If noone else comes up with something more builtin, here is a pedestrian method:

>>> dates
array([19700101., 19700102., 19700103., 19700104., 19700105., 19700106.,
       19700107., 19700108., 19700109., 19700110., 19700111., 19700112.,
       19700113., 19700114.])
>>> y, m, d = dates.astype(int) // np.c_[[10000, 100, 1]] % np.c_[[10000, 100, 100]]
>>> y.astype('U4').astype('M8') + (m-1).astype('m8[M]') + (d-1).astype('m8[D]')
array(['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-04',
       '1970-01-05', '1970-01-06', '1970-01-07', '1970-01-08',
       '1970-01-09', '1970-01-10', '1970-01-11', '1970-01-12',
       '1970-01-13', '1970-01-14'], dtype='datetime64[D]')
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Thank you. Could you please explain those last two lines as I am not familiar with anything following `dates.astype(int)`? – zorrolo Mar 31 '19 at 13:48
  • 2
    @zorrolo `np.c_[]` can be used to create column vectors; here this has the effect that due to broadcasting the result of the floor division `//` is a full table of each pair that can be formed between dates and `10000, 100, 1`. Thus we get three copies of dates, one with the last 4 digits removed, one with the two last digits removed and one unchanged. `%` is modulo here it removes from the left all but 4 digits (which is a nop at this place), and twice all but 2 digits. As a result we will have in variables `y, m, d`, the year, month and day separately. – Paul Panzer Mar 31 '19 at 15:18
  • 1
    ... Next we convert the year first to unicode, then to datetime64. and add the month and day both converted to timedelta64. – Paul Panzer Mar 31 '19 at 15:18
  • Great! Is there also a modification of these procedure to find every date in for example, march ignoring years and days? I was looking for a way to filter that array of dates for months but couldn't find any build in np.datetime64 function to do so. – zorrolo Mar 31 '19 at 18:43
  • 1
    @zorrolo This seems to work: `a[(a.astype('M8[M]') - a.astype('M8[Y]')).view(int) == 2]` – Paul Panzer Apr 01 '19 at 02:24
0

You can go via the python datetime module.

from datetime import datetime
import numpy as np

datestrings = np.array(["18930201", "19840404"])
dtarray = np.array([datetime.strptime(d, "%Y%m%d") for d in datestrings], dtype="datetime64[D]")
print(dtarray)

# out: ['1893-02-01' '1984-04-04'] datetime64[D]

Since the real question seems to be how to get the given strings into the matplotlib datetime format,

from datetime import datetime
import numpy as np
from matplotlib import dates as mdates

datestrings = np.array(["18930201", "19840404"])
mpldates = mdates.datestr2num(datestrings)
print(mpldates)

# out: [691071. 724370.]
ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • I have to explain that I am working with many dates, like 40k because of daily measurements over aprox 125 years. As far as I understand python it is faster to work with numpy array in my depicted case. I had concerns to store the dates not as `datetime64`. A few days ago I tried something with `datestr2num()` but the related plot did'nt display the dates on x-axis in a convenient format. So I switched back to what worked for me as I am running out of time for this project. – zorrolo Apr 02 '19 at 07:24
  • Indeed, if you use `mpldates` from above, you would need to set the locator and formatter on the axis yourself. Concerning speed, it is a bit of a paradox that you want to save a few milliseconds of time by not using pandas, while trying to plot 40k data on screen. Graphical output is always the bottleneck. Also consider that one might have spent the time of the effort on trying to avoid using pandas into *learning* pandas. E.g. you could use pandas to subsample the 40k data into a smaller dataset which is much faster in plotting. – ImportanceOfBeingErnest Apr 02 '19 at 11:32
  • Oh the plot is not the main function of the program! Many calculations are done in advance where numpy and some pieces of scipy are absolutely satisfying. Therefore it would be nonsense to rewrite the whole script to make it work with pandas. But thank you for the additional information. – zorrolo Apr 02 '19 at 12:18