0

I have a pandas dataframe read from a csv file which includes a column of dates

In[1]: df = pd.read_csv(fileStats, index_col=0, parse_dates= ['start_date'])

In[2]: df.head()
Out[3]:
                 start_date  
    customer_ID                                                               
    10006414       2012-10-02 08:30:00   
    10006486       2013-12-02 09:00:00   
    10006492       2013-12-02 09:00:00   
    10006572       2012-03-31 08:30:00   
    10006630       2012-01-06 10:30:00  

I want to find the earliest date:

In[4]: df['start_date'].min()
Out[5]:
Timestamp('2011-01-12 08:00:00')

which looks fine except that when I open the .csv file in excel, and order the rows by start_date, the earliest date is 2011-10-06 13:00:00. The same problem happens with .max() This post refers to a similar problem, but related to np.datetime64 which I am not using here. Is there an issue with .max() and .min() or is it to do with parsing the dates in .read_csv ?

EDIT: Further investigation on suggestions from @DYZ shows that the Timestamps have been read or parsed incorrectly. 6/10/2011 13:00 in the .csv file has become 2014-03-03 10:00:00 in the dataframe. How to correct this?

Community
  • 1
  • 1
doctorer
  • 1,672
  • 5
  • 27
  • 50
  • Start by checking the earliest date is _in_ `df`. – DYZ Apr 10 '17 at 02:14
  • `In[6]: earliest = df['start_date'].min()` `In[7]: earliest in df` `Out[8]: False` – doctorer Apr 10 '17 at 02:19
  • `earliest in df` does not check if a row is in a dataframe. Check `earliest in df['start_date'].values`. – DYZ Apr 10 '17 at 02:23
  • `earliest in df['start_date'].values` returns `False` – doctorer Apr 10 '17 at 02:25
  • So, looks like that line was not read, or was not read correctly, or was not parsed correctly. I would check id the number of rows in `df` and in the CSV file is the same. – DYZ Apr 10 '17 at 02:26
  • Good suggestion. Yes they have the same number of rows. Looking at the specific `customer_ID` shows that the timestamp been read or parsed incorrectly. `6/10/2011 13:00` has become `2014-03-03 10:00:00` – doctorer Apr 10 '17 at 02:37
  • My guess is that it actually reads `6/10/2011 13:00` as `2011-06-10 13:00:00` and `01/12/2011 08:00:00` as `2011-01-12 08:00:00`. Try adding `dayfirst=True` as an argument to `pd.read_csv`. – tmrlvi Apr 14 '17 at 10:28

0 Answers0