-1

The dates in my dataset are inconsistent. Is there any way to make them in a particular format like YY/DD/MM? All these dates are of the month of January and are continuous but date and month flipped from the 7th row.

0     2016-04-01
1     2016-05-01
2     2016-06-01
3     2016-07-01
4     2016-08-01
5     2016-11-01
6     2016-12-01
7     2016-01-13
8     2016-01-14
9     2016-01-15
10    2016-01-18
11    2016-01-19

CSV Data looks something like this. Date in this file is an object not in actual date format. enter image description here

Expected output should look something like this:

0     2016-04-01
1     2016-05-01
2     2016-06-01
3     2016-07-01
4     2016-08-01
5     2016-11-01
6     2016-12-01
7     2016-13-01
8     2016-14-01
9     2016-15-01
10    2016-18-01
11    2016-19-01

All the dates are in YY/DD/MM format

  • not without converting to a string but then you lose the datetime64 dtype and this loses any useful arithmetic operations and comparison operations – EdChum Jan 25 '19 at 16:57
  • 1
    Yes, several ways. What did you try from your research? What is the source of the data? – roganjosh Jan 25 '19 at 16:57
  • Possible duplicate of [How to print a date in a regular format?](https://stackoverflow.com/questions/311627/how-to-print-a-date-in-a-regular-format) – Hippolippo Jan 25 '19 at 16:58
  • I'm not sure what's inconsistent about them. They seem to be in a consistent YYYY-MM-DD format. What have you tried? – Jordan Singer Jan 25 '19 at 16:58
  • 1
    @JordanSinger the `mm-dd`/`dd-mm` ordering is being inferred and the rule flips once it becomes clear you can't have a 13th month – roganjosh Jan 25 '19 at 16:59
  • I tried pd.to_datetime function, but it was also of no use. The source of data is a csv file – Prateek Sharma Jan 25 '19 at 17:00
  • 1
    @roganjosh it isn't clear whether there really is a flip. All of these dates as listed make sense under the YYYY-MM-DD format. – Jordan Singer Jan 25 '19 at 17:00
  • 1
    These are not consistent. They are all the dates of the month of January but they get mixed up in between, like from 5th row. – Prateek Sharma Jan 25 '19 at 17:01
  • 2
    @JordanSinger it is _possible_ that they are all real dates, but this is a common output when the date format is being inferred. – roganjosh Jan 25 '19 at 17:02
  • 1
    This is all dependent on the source of this data. Is the source reliable? We cannot answer that without more information. @PrateekSharma I would trace this back to the data's source and try to figure out how this csv is being generated in the first place. – Jordan Singer Jan 25 '19 at 17:02
  • 1
    You could try the `day_first` parameter of [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). We do really need to see the raw CSV data though to understand how to fix it – roganjosh Jan 25 '19 at 17:04
  • 1
    @Prateek I didn't really understand the problem till I read the comments. Could you [edit] the question to clarify? Adding the expected output would work. – wjandrea Jan 25 '19 at 17:35
  • @wjandrea did that – Prateek Sharma Jan 25 '19 at 17:39
  • 2
    Your edit does _not_ show the raw CSV data. Excel performs all sorts of incantations on data, don't trust it. View in Notepad++ or something to see the actual raw data – roganjosh Jan 25 '19 at 18:10
  • 1
    Do not use YYYY-DD-MM! It looks like [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) (YYYY-MM-DD) but it's not. Use YY/DD/MM if you have to, but even that's not a standard format. – wjandrea Jan 25 '19 at 18:54

1 Answers1

0

If you already know precisely what rows for which you need to have the date reformatted (and if the .csv isn't too large), you can call pd.to_datetime() and then strftime() to have the necessary "dates" displayed in the format you desire. The downside of this is that as one of the commenters pointed out, the datatype of these will not be datetime64.

If you're okay with this, and if the only thing that matters is the appearance of the dates (and you won't need to perform arithmetic operations on them), here's how it would go using the example you gave:

d = {'date': ['2016-04-01', '2016-05-01', '2016-06-01', '2016-07-01', '2016-08-01',
              '2016-11-01', '2016-12-01', '2016-01-13', '2016-01-14', '2016-01-15',
              '2016-01-18', '2016-01-19']}

df = pd.DataFrame(data=d)

    date
0   2016-04-01
1   2016-05-01
2   2016-06-01
3   2016-07-01
4   2016-08-01
5   2016-11-01
6   2016-12-01
7   2016-01-13
8   2016-01-14
9   2016-01-15
10  2016-01-18
11  2016-01-19

Running this line will convert the final five rows to the appearance you desire:

df['date'].iloc[7:] = pd.to_datetime(df['date'].iloc[7:]).apply(lambda x: x.strftime('%Y-%d-%m'))

The resulting dataframe looks like this:

    date
0   2016-04-01
1   2016-05-01
2   2016-06-01
3   2016-07-01
4   2016-08-01
5   2016-11-01
6   2016-12-01
7   2016-13-01
8   2016-14-01
9   2016-15-01
10  2016-18-01
11  2016-19-01

If we run df.iloc[11].dtype we see that the datatype is dtype('O').

James Dellinger
  • 1,281
  • 8
  • 9