0

Hi I am struggling with finding the max and min of date column in my dataset.

Below is my dataset:

customer_final['tran_date']

0       2014-02-28
1       2014-02-27
2       2014-02-24
3       2014-02-24
4       2014-02-23
           ...    
23048   2011-01-25
23049   2011-01-25
23050   2011-01-25
23051   2011-01-25
23052   2011-01-25

As clearly we can see in the dataset we have data between 2011-01-25 to 2014-02-28.

But executing the below-mentioned code is giving me the wrong output.

print(customer_final['tran_date'].max())
print(customer_final['tran_date'].min())

2014-12-02 00:00:00
2011-01-02 00:00:00

Any help would be highly appreciated.

Edit: Posting raw data.

transaction_id  cust_id tran_date   prod_subcat_code    prod_cat_code   Qty Rate    Tax total_amt   Store_type
0   80712190438 270351  28-02-2014  1   1   -5  -772    405.300 -4265.300   e-Shop
1   29258453508 270384  27-02-2014  5   3   -5  -1497   785.925 -8270.925   e-Shop
2   51750724947 273420  24-02-2014  6   5   -2  -791    166.110 -1748.110   TeleShop
3   93274880719 271509  24-02-2014  11  6   -3  -1363   429.345 -4518.345   e-Shop
4   51750724947 273420  23-02-2014  6   5   -2  -791    166.110 -1748.110   TeleShop
... ... ... ... ... ... ... ... ... ... ...
23048   94340757522 274550  25-01-2011  12  5   1   1264    132.720 1396.720    e-Shop
23049   89780862956 270022  25-01-2011  4   1   1   677 71.085  748.085 e-Shop
23050   85115299378 271020  25-01-2011  2   6   4   1052    441.840 4649.840    MBR
23051   72870271171 270911  25-01-2011  11  5   3   1142    359.730 3785.730    TeleShop
23052   77960931771 271961  25-01-2011  11  5   1   447 46.935  493.935 TeleShop

Edit 2: Datatypes of all the columns in the DF.

transaction_id               int64
cust_id                      int64
tran_date           datetime64[ns]
prod_subcat_code             int64
prod_cat_code                int64
Qty                          int64
Rate                         int64
Tax                        float64
total_amt                  float64
Store_type                  object
Unnamed: 10                 object
dtype: object
gm-123
  • 248
  • 3
  • 16
  • Have you tried this? https://stackoverflow.com/questions/23178129/getting-min-and-max-dates-from-a-pandas-dataframe – thebernardlim Mar 08 '20 at 14:38
  • Is your column of type datetime? – coco18 Mar 08 '20 at 14:39
  • Yes sir! But no luck. – gm-123 Mar 08 '20 at 14:41
  • Try to get `customer_final['tran_date'].max(). dt.day` at the max position and see if your datetime type is correct. – coco18 Mar 08 '20 at 14:43
  • @coc018 Yes, it's dtype is `datetime64[ns]` – gm-123 Mar 08 '20 at 14:44
  • Perhaps the day is on the second place. That's why ist getting the right results. Try this as well: `pd.to_datetime(customer_final['tran_date')).max()` – coco18 Mar 08 '20 at 14:47
  • @coco18 it's also giving the same result : `Timestamp('2014-12-02 00:00:00')`. – gm-123 Mar 08 '20 at 14:51
  • Did you try to get the day of the max? – coco18 Mar 08 '20 at 14:52
  • @coco18 Yes. I tried to get the day for every date using `dt.day`. It's picking the correct day. But while attempting the do so on the max date using `customer_final['tran_date'].max(). dt.day` it's giving error: `'Timestamp' object has no attribute 'dt'`. – gm-123 Mar 08 '20 at 14:59
  • Yes, but you can get it with `loc[] ` only to see which day and month returns. – coco18 Mar 08 '20 at 15:02
  • before converting to DateTime specify the format `pd.to_datetime(...,format='%Y-%m-%d')` then call your min and max. – Umar.H Mar 08 '20 at 15:06
  • @Datanovice It's giving some weird error: `time data '12/2/2014' does not match format '%d-%m-%Y'` – gm-123 Mar 08 '20 at 15:12
  • The change the `-` with `/` – coco18 Mar 08 '20 at 15:13
  • can you post your raw data? – Umar.H Mar 08 '20 at 15:14
  • Was your data sorted by *tran_date* before running `customer_final['tran_date']` that you post above? If not, do so with `sort_values` and report back. Also, please post `customer_final.dtypes`. – Parfait Mar 08 '20 at 15:50
  • Yes @Parfait, upon sorting the column I can see the values are as returned by Python. But, upon doing so manually in Excel I can see my data is for the date range 2011-01-25 to 2014-02-28. – gm-123 Mar 08 '20 at 16:04
  • So, is your question resolved in Python pandas? Specifically, do dates `2014-12-02` and `2011-01-02` show after sorting: `customer_final = customer_final.sort_values(by='tran_date', ascending = False)` – Parfait Mar 08 '20 at 16:11
  • No @Parfait. Inspecting the data set manually I can see, I have data available for date range `2011-01-25` to `2014-02-28`. But upon applying sort values on the respective column I can see the data is between `2011-01-02` and `2014-12-02`. – gm-123 Mar 08 '20 at 16:17

2 Answers2

0

Apparently you have your dates (in an input file) formatted with various ways.

One of your comments contains Timestamp('2014-12-02 00:00:00'), so I see that you have %Y-%m-%d formatting (probably in most cases), but in another place you have written time data '12/2/2014', so at least in some rows you have %d/%m/%Y formatting.

Bring your input to order. You can't have dates formatted in 2 different formattings.

Edit

I performed the following experiment:

As the source data I used a part of your raw data (first 2 and last 2 rows), with additional row (No 3) with different date formatting, stored as a string variable:

  transaction_id cust_id tran_date prod_subcat_code prod_cat_code Qty Rate Tax total_amt Store_type
3       93274880719 271509  24-02-2014  11  6   -3  -1363   429.345 -4518.345   e-Shop
4       51750724947 273420  23-02-2014  6   5   -2  -791    166.110 -1748.110   TeleShop
40      51750724947 273420  12/2/2014   6   5   -2  -791    166.110 -1748.110   TeleShop
23048   94340757522 274550  25-01-2011  12  5   1   1264    132.720 1396.720    e-Shop
23049   89780862956 270022  25-01-2011  4   1   1   677     71.085  748.085     e-Shop'''

Note that the initial row has some spaces at the beginning, to provide the empty column name for the index column.

Then I defined the following date parsing function (import re required), to be used soon:

def dPars(txt):
    if re.match(r'\d{2}-\d{1,2}-\d{4}', txt):
        return pd.to_datetime(txt, format='%d-%m-%Y')
    if re.match(r'\d{2}/\d{1,2}/\d{4}', txt):
        return pd.to_datetime(txt, format='%d/%m/%Y')
    return txt

I read the above content, with the above date converter function:

customer_final = pd.read_csv(io.StringIO(txt), delim_whitespace=True,
    index_col=0, parse_dates=['tran_date'], date_parser=dPars)

I printed tran_date column - print(customer_final.tran_date) - getting

3       2014-02-24
4       2014-02-23
40      2014-02-12
23048   2011-01-25
23049   2011-01-25
Name: tran_date, dtype: datetime64[ns]

so all dates have been parsed as they should be.

I printed min / max dates - print(customer_final['tran_date'].min(), customer_final['tran_date'].max()) - getting correct result:

2011-01-25 00:00:00 2014-02-24 00:00:00

Maybe you should base your code on my experiment (in your code replace io.StringIO(txt) with your input file name).

Note also that if you have some input row formatted as 12/2/2014 then 12 is the month number and 2 is the day number (US date format), whereas other rows have the day number first.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • I have tried different date formats to check there exist any problem within the data set. – gm-123 Mar 08 '20 at 15:46
  • If original field was already a datetime as [OP comments](https://stackoverflow.com/questions/60588564/finding-min-and-max-of-date-column-in-pandas#comment107190321_60588564), formatting cannot be the issue as mixed types are not allowed with this type. It is highly probable OP simply did not sort the data. – Parfait Mar 08 '20 at 15:55
0

Fundamentally, you have two issues: 1) viewing abbreviated data and 2) viewing unsorted data.

You claim: As clearly we can see in the dataset we have data between 2011-01-25 to 2014-02-28. However, Pandas is abbreviating your unsorted data, omitting many rows of your 23k row data frame with ellipsis: .... Hence, this manual inspection of dates you cite are only from the head and tail of the unordered data that will not match min and max values.

customer_final['tran_date']

# 0       2014-02-28       # <---- HEAD OF UNSORTED DATA
# 1       2014-02-27
# 2       2014-02-24
# 3       2014-02-24
# 4       2014-02-23
#            ...           # <---- OMITTED VALUES OF UNSORTED DATA 
# 23048   2011-01-25
# 23049   2011-01-25
# 23050   2011-01-25
# 23051   2011-01-25
# 23052   2011-01-25       # <---- TAIL OF UNSORTED DATA

You can remove omission of rows with pd.set_option('display.max_rows', None) but you may overwhelm yourself showing 23k+ values that are not sorted.

Therefore, those min and max are not wrong. To double check, actually sort your data and then print the column out or its head and tail. Doing so, the aggregate figures should match accordingly.

# SORT DATA FRAME IN DESCENDING ORDER BY tran_date
customer_final = customer_final.sort_values(by='tran_date', ascending = False)

# VIEW ALL DATA (ABBREVIATED UNLESS YOU CHANGE SETTING)
customer_final['tran_date']

# VIEW FIRST VALUES (DEFAULT TO 5)
customer_final['tran_date'].head(5)

# VIEW LAST VALUES (DEFAULT TO 5)
customer_final['tran_date'].tail(5)
Parfait
  • 104,375
  • 17
  • 94
  • 125