0

I have dates in this format 12/29/2011 as a string I guess and I only need Year so I write this function to extract year only but I got
"ValueError: cannot convert float NaN to integer" Seems like I have Nan's somewhere and only solution I can think of is to drop the rows with Nan's but I cant do that coz I need the data from other columns.

 def get_year(date):
    year = ''
    try:
        year = date[-4:]
    except TypeError:
        year = str(date)[0:4]
    return (year).astype(int)

The get_year function works when I use this code

for i in df.index:
    if (not pd.isna(df['yearOpened'][i]) and get_year(df['yearOpened'][i]) > 1955):
            print('something')

I am using .loc and wants to know how to skip Nan's using .loc

`df.loc[get_year(df['yearOpened'])]`
  • There is no date. You can't make a correct date just appear. What value is passed to `get_year` when throwing this error? – ifly6 Feb 27 '20 at 22:22
  • https://stackoverflow.com/questions/47333227/pandas-valueerror-cannot-convert-float-nan-to-integer?rq=1 – emsimpson92 Feb 27 '20 at 22:27
  • @ifly6 values from column 'yearOpened' that are in these format as a string 12/12/1999 –  Feb 27 '20 at 22:30
  • `ValueError Traceback (most recent call last) in ----> 1 df.loc[get_year(df['yearOpened'])] in get_year(date) 7 year = str(date)[0:4] 8 # print(year[0:4]) ----> 9 return (year).astype(int)` –  Feb 27 '20 at 22:33
  • Just parse them with `pd.to_datetime(your_string_series)` and then access date with `date_series.dt.year` – ifly6 Feb 27 '20 at 22:36
  • @ifly6 will that handle the Nan error that is my main concern if the datasets doesnot have Nan or empty value my function works –  Feb 27 '20 at 22:41
  • `for i in df.index: if (not pd.isna(df['yearOpened'][i]) and get_year(df['yearOpened'][i]) > 1955): print('something')` This works because not pd.isna handles the Nan values –  Feb 27 '20 at 22:41
  • `# df['year'] = df['yearBuilt'].str[0:2] # df['year'] = df['year'].astype('Int32') # df.head()` Also Tried this but Nan values or missing values is the main concern –  Feb 27 '20 at 22:46
  • What is the value passed to `get_year` when this error is thrown? – ifly6 Feb 27 '20 at 22:49
  • **Please share the entire error message as well as a [mcve].** – AMC Feb 27 '20 at 23:56
  • The loop code you shared is quite unidiomatic, be careful. I would strongly recommend reading the Pandas docs. – AMC Feb 27 '20 at 23:57

1 Answers1

0

You can use Python's inbuilt datetime library to grab the year from your string with ease.

from datetime import datetime

date = '12/29/2011'
dt = datetime.strptime(date, '%m/%d/%Y') #create datetime object
dt.year

Output: 2011

OR

You could use the pandas.to_datetime function which will handle the nan values for you.

import pandas as pd
import numpy as np

dates = ['12/29/2011', '12/30/2012', np.nan]
dt = pd.to_datetime(dates)
dt.year

Output: Float64Index([2011.0, 2012.0, nan], dtype='float64')

Edit (in response to comments)

To get a DataFrame of all the bad rows you can simply index out all the rows that return NaT from the pd.to_datetime(df['dates'], errors='coerce') operation.

data = {'dates': ['12/29/2011', '12/30/2012', np.nan, '1/1/9999'],
        'values': [1,2,3,4]}

df = pd.DataFrame(data)
dt = pd.to_datetime(df['dates'], errors='coerce')

bad_rows = df[dt.isna()] #indexing out all rows which contain nan values
bad_rows.to_csv('bad_data.csv')
Ali
  • 328
  • 2
  • 7
  • Thank You this worked but got another error apparently data I am working with is not super clean `OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 9999-12-31 00:00:00` –  Feb 27 '20 at 23:02
  • got Out of bound error with this : `def get_year(date): pd.to_datetime(date) return dt.year df.loc[get_year(df['yearBuilt'])]` Can anyone please suggest me how to handle error on this one OR `def get_year(date): year = '' try: year = date[-4:] except TypeError: year = str(date)[0:4] return (year).astype(int)` Ways to handle Nan on this one THANKS A LOT in advance –  Feb 27 '20 at 23:07
  • you could set errors to 'coerce': `dt = pd.to_datetime(dates, errors='coerce')`. This would replace the out of bound datetime with NaT. Not sure if this is what you want though. – Ali Feb 27 '20 at 23:28
  • I am trying to flag if the data has anything funky like that. So if the data has something like that or it's empty I need to save the whole row on different dataframe and export it to csv. Does that make sense ? –  Feb 28 '20 at 00:21
  • I have edited the answer with the solution to your problem as I understand it. Hope that helps. – Ali Feb 28 '20 at 22:26
  • Hi, is that what you were looking for? – Ali Mar 01 '20 at 15:15
  • @ ALS777 not exactly but that solved my problem. Thank you very much –  Mar 03 '20 at 23:49