1

I have an Excel files with a column of date-like string mixed with some illegal date-strings, just like: my column

I want to know how to use pandas to read it with exactly the same as original.

I don't want any date conversion.

I've tried many ways including

df = pd.read_excel(path)
df['Tran Date'] = df['Tran Date'].apply(lambda x: x.strftime('%m/%d/%y') if x else "")

I don't want to raise an Exception in the lambda since the data is millions rasing so much exception would cause severe performance problems.

So is there a way to treat the column as string and stop any date conversion?

SkyOne
  • 188
  • 3
  • 15

1 Answers1

1

When reading the Excel you can specify the dtype. So:

df = pd.read_excel('file',dtype={'Tran Date'='object'})
print(df.dtypes)
reg_date        object
dtype: object
print(df)
0      2020/1/18
1      2020/1/18
2      2020/1/18
3      2020/1/18
4      2020/1/18
5      2020/1/18
6      2020/1/18
7      2020/1/18
8      2020/1/18
9      2020/1/18
10    09/19/CN11
11    09/19/CN11
12    09/19/CN11
13    09/19/CN11
14    09/19/CN11
15    09/19/CN11
16    09/19/CN11
17    09/19/CN11

Added after comment

Something like this would do the work with the problem mentioned in comment although we are back to lambda, which you were trying to avoid:

df['Tran Date'] = df['Tran Date'].apply(lambda x: x.rstrip('00:00:00'))
df['Tran Date'] = df['Tran Date'].apply(lambda x: x.replace('-','/'))
df['Tran Date'] = df['Tran Date'].apply(lambda x: (x[-3:-1]+'/'+x[-6:-4]+'/'+x[0:-7]) if (x[2] != '/') else x)

0     18/01/2020
1     18/01/2020
2     18/01/2020
3     18/01/2020
4     18/01/2020
5     18/01/2020
6     18/01/2020
7     18/01/2020
8     18/01/2020
9     18/01/2020
10    09/19/CN11
11    09/19/CN11
12    09/19/CN11
13    09/19/CN11
14    09/19/CN11
15    09/19/CN11
16    09/19/CN11
17    09/19/CN11
teoeme139
  • 412
  • 3
  • 11
  • I'm not sure why it still converts to '2020-01-18 00:00:00'? – SkyOne Feb 21 '20 at 12:20
  • Well, it depends on the format you apply to those cells in Excel. You need to save the file as CSV and then import or save the dates as string in Excel. Otherwise, perform some kind of conversion in Pandas to remove hour, minutes, seconds – teoeme139 Feb 21 '20 at 12:34
  • See edit in answer – teoeme139 Feb 21 '20 at 12:52