Unexpected parsing of Timestamps reading pandas dataframe from .csv file

Question

I have a pandas dataframe read from a csv file which includes a column of dates

In[1]: df = pd.read_csv(fileStats, index_col=0, parse_dates= ['start_date'])

In[2]: df.head()
Out[3]:
                 start_date  
    customer_ID                                                               
    10006414       2012-10-02 08:30:00   
    10006486       2013-12-02 09:00:00   
    10006492       2013-12-02 09:00:00   
    10006572       2012-03-31 08:30:00   
    10006630       2012-01-06 10:30:00

I want to find the earliest date:

In[4]: df['start_date'].min()
Out[5]:
Timestamp('2011-01-12 08:00:00')

which looks fine except that when I open the .csv file in excel, and order the rows by start_date, the earliest date is 2011-10-06 13:00:00. The same problem happens with .max() This post refers to a similar problem, but related to np.datetime64 which I am not using here. Is there an issue with .max() and .min() or is it to do with parsing the dates in .read_csv ?

EDIT: Further investigation on suggestions from @DYZ shows that the Timestamps have been read or parsed incorrectly. 6/10/2011 13:00 in the .csv file has become 2014-03-03 10:00:00 in the dataframe. How to correct this?

`In[6]: earliest = df['start_date'].min()` `In[7]: earliest in df` `Out[8]: False` — doctorer, Apr 10 '17 at 02:19
`earliest in df` does not check if a row is in a dataframe. Check `earliest in df['start_date'].values`. — DYZ, Apr 10 '17 at 02:23
So, looks like that line was not read, or was not read correctly, or was not parsed correctly. I would check id the number of rows in `df` and in the CSV file is the same. — DYZ, Apr 10 '17 at 02:26
Good suggestion. Yes they have the same number of rows. Looking at the specific `customer_ID` shows that the timestamp been read or parsed incorrectly. `6/10/2011 13:00` has become `2014-03-03 10:00:00` — doctorer, Apr 10 '17 at 02:37
My guess is that it actually reads `6/10/2011 13:00` as `2011-06-10 13:00:00` and `01/12/2011 08:00:00` as `2011-01-12 08:00:00`. Try adding `dayfirst=True` as an argument to `pd.read_csv`. — tmrlvi, Apr 14 '17 at 10:28

Unexpected parsing of Timestamps reading pandas dataframe from .csv file

0 Answers0