0

How can I extract only the date (not the time) using Pandas' to_datetime?

Let's say I have this string:

>>> date_string = "1975-02-23 02:58:41+00:00"

If I wanted just the year then this works well:

>>> import pandas as pd
>>> pd.to_datetime(date_string, format="%Y", exact=False)
Timestamp('1975-01-01 00:00:00')

But it seems this technique breaks down if I want say, the year and month, or the year, month, and day:

>>> pd.to_datetime(date_string, format="%Y-%m", exact=False)
Timestamp('1975-02-23 02:58:41+0000', tz='UTC')

>>> pd.to_datetime(date_string, format="%Y-%m-%d", exact=False)
Timestamp('1975-02-23 02:58:41+0000', tz='UTC')

Why is it that Pandas is extracting more than I wanted? How can I limit parsing to only what I specify?

I'm specifically interested in the parsing process. For example, let's say the string contained errors in the "hours" portion, and I didn't care about that. I would like to parse only the year-month-day portion.

>>> date_string_with_error = "1975-02-23 25:58:41+00:00"
>>> pd.to_datetime(date_string_with_error, format="%Y", exact=False)
Timestamp('1975-01-01 00:00:00')
>>> pd.to_datetime(date_string_with_error, format="%Y-%m", exact=False)
Traceback (most recent call last):
  File "/Users/matthew/.local/share/virtualenvs/PyOgg-ewux8jXO/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py", line 2054, in objects_to_datetime64ns
    values, tz_parsed = conversion.datetime_to_datetime64(data)
  File "pandas/_libs/tslibs/conversion.pyx", line 350, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/matthew/.local/share/virtualenvs/PyOgg-ewux8jXO/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 830, in to_datetime
    result = convert_listlike(np.array([arg]), format)[0]
  File "/Users/matthew/.local/share/virtualenvs/PyOgg-ewux8jXO/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 459, in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
  File "/Users/matthew/.local/share/virtualenvs/PyOgg-ewux8jXO/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py", line 2059, in objects_to_datetime64ns
    raise e
  File "/Users/matthew/.local/share/virtualenvs/PyOgg-ewux8jXO/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py", line 2044, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 352, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 496, in pandas._libs.tslib.array_to_datetime
ValueError: time data 1975-02-23 25:58:41+00:00 doesn't match format specified
Matthew Walker
  • 2,527
  • 3
  • 24
  • 30
  • See the duplicate link. Note there;s a difference between a string representation of the datetime and an actual datetime type. So first you convert to datetime. then use `dt.date` to only get the date. – Erfan Nov 22 '20 at 23:32
  • @Erfan, thanks for your quick review and link to the other question. In the linked question the solutions do not discuss parsing but rather the interpretation of the results (using, for example `dt.year`). The question I am asking is about parsing and how to control it. Although the end results are very similar, the process is quite different. Can you re-open this question please? – Matthew Walker Nov 23 '20 at 01:08
  • Your answer is `pd.to_datetime(date_string).dt.year`, I will reopen, but this is definitely a duplicate, I just dont have time to find it. – Erfan Nov 23 '20 at 01:13
  • I think you misunderstand parsing, strings and datetime. I would suggest to check the pandas documentation. – Erfan Nov 23 '20 at 01:14
  • @Erfan, thanks for re-opening the question. I've edited it to show (hopefully) more clearly that I'm interested in the parsing process. – Matthew Walker Nov 23 '20 at 01:26
  • Again you are misunderstanding parsing, try: `pd.to_datetime(date_string_with_error).strftime('%Y')` – Erfan Nov 23 '20 at 01:27
  • `pd.to_datetime(date_string_with_error)` produces the error `dateutil.parser._parser.ParserError: hour must be in 0..23: 1975-02-23 25:58:41+00:00`. As a consequence the `.strftime('%Y')` isn't called. I'm not sure I understand your suggestion. – Matthew Walker Nov 23 '20 at 01:31
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/224945/discussion-between-erfan-and-matthew-walker). – Erfan Nov 23 '20 at 01:32

1 Answers1

1
pd.to_datetime(date_string).year

enter image description here

wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Did you test this? – Erfan Nov 23 '20 at 00:20
  • @Erfan Yap, I tested on latest versions. Happy for any criticism, comments if you have any – wwnde Nov 23 '20 at 00:21
  • This works for a scalar, but I can imagine that OP is asking for columns, since this is a pandas questions. For columns you will need the `dt` accessor. – Erfan Nov 23 '20 at 00:22
  • I am a ware of the `dt.accessor` in dataframes. He however used the word `parse` which I imagined he wanted to throw into a loop. – wwnde Nov 23 '20 at 00:24
  • @Erfan I am satisfied if he didnt mean a scalar, he can refere\ to the duplicates referenced in the close comment – wwnde Nov 23 '20 at 00:25
  • 1
    Yes sure, makes sense. – Erfan Nov 23 '20 at 00:26
  • Thanks very much wwnde and @Erfan. Indeed, your solution is a way to extract the output once the entire string has been parsed. However, I'm interested in the control of the parsing process. Why is it that the last and second-to-last approaches that I proposed do not work? How should I go about controlling the parsing process? Why does it work for years, but not for years-months or years-months-days? Is it not able to be controlled perhaps? – Matthew Walker Nov 23 '20 at 01:12
  • As for the discussion regarding scalars versus series, I am indeed interested in using the approach for a series. I felt that it was just simpler to discuss the issue in its most basic form. But if I'm wrong to assume that a scalar response can be re-interpreted for a series then I'm all ears. Thanks again. – Matthew Walker Nov 23 '20 at 01:17