-1

I have a date column in my DataFrame say df_dob and it looks like -

id DOB
23312 31-12-9999
1482 31-12-9999
807 #VALUE!
2201 06-12-1925
653 01/01/1855
108 01/01/1855
768 1967-02-20

What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']

basically through this list I just wanted to check whether there is some unwanted year is present or not. I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.

df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))

P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00

maven
  • 15
  • 5

3 Answers3

0

Use pandas' unique for this. And on year only.

So try:

print(df['DOB'].dt.year.unique())

Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:

df_dob['DOB'] = pd.to_datetime(df_dob.DOB)    # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())

Edit:

Another method: Since you have outofbounds problem, Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex. So,

df['DOB'].str.extract(r'(\d{4})')[0].unique()

[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.

Nitish
  • 392
  • 2
  • 7
  • Hi @Nitish, I have tried doing but I get `OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 9999-12-31 00:00:00` – maven Aug 19 '22 at 09:08
  • I see you have some time with year 9999. Whatsup with that? – Nitish Aug 19 '22 at 09:17
  • Yes, that is included in the data, it might be 1999, but mistakenly typed 9999, that is the reason I need to print the list of years to check. – maven Aug 19 '22 at 09:26
  • 1
    Pandas has a limit to how many years of timespan you can have in a column. Roughly 500 years or so. More details in this post https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a – Nitish Aug 19 '22 at 09:29
  • Hi @Nitish, thanks for guiding me, so is there any way to filter out my data based on valid dates, so I just wanted to have those rows where there is a valid date – maven Aug 19 '22 at 09:58
0

Can you try this?

df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])

df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Nova
  • 406
  • 2
  • 13
  • Hi @Nova, I tried but I get - `OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 9999-12-31 00:00:00` – maven Aug 19 '22 at 09:28
  • 1
    corrected with `df_dob["DOB"] = pd.to_datetime(df_dob["DOB"], errors = 'coerce')` – maven Aug 19 '22 at 09:33
0

The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()

If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.

We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.

years = list({i for i in df['date'].dt.year})

And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.

That's a list as you indicated. If you want it as a column, you won't get unique values

Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])

hrokr
  • 3,276
  • 3
  • 21
  • 39