0

I read an excel file from a URL and need to do some cleaning before I can save it. In the original excel file, there are some logos and null entries in the first few rows and then the real data starts. The column date (Unnamed 0:) in excel file appears as date but when read into pandas for some reason it is converted to number. Using astype and pd.to_datetime converts the column to date but incorrect date. Any advice?

note: although date column appears as date in excel sheet, its type is general. I can go and manually change the type first in the excel file but I do not want to do that because I want to automate the process. enter image description here

if you want to produce data, here is the code:

from requests import get
import pandas as pd

url = 'http://rigcount.bakerhughes.com/static-files/55ff50da-ac65-410d-924c-fe45b23db298'
# make HTTP request to fetch data
r = get(url)

# check if request is success
r.raise_for_status()

# write out byte content to file
with open('out.xlsb', 'wb') as out_file:
    out_file.write(r.content)

Canada_Oil_Gas = pd.read_excel('out.xlsb', sheet_name='Canada Oil & Gas Split', engine='pyxlsb')
baharak Al
  • 71
  • 1
  • 10
  • Does this answer your question? [Pandas Read\_Excel Datetime Converter](https://stackoverflow.com/questions/42958217/pandas-read-excel-datetime-converter) – Warcupine Oct 19 '21 at 19:17
  • I tried all suggestions in that post. I still get DATE column as the picture above. – baharak Al Oct 19 '21 at 19:24

1 Answers1

1

The problem is caused by the cell format [$-409].

You can fix it manually as below:

df = pd.read_excel('out.xlsb', sheet_name='Canada Oil & Gas Split',
                   skiprows=6, engine='pyxlsb')

df['DATE'] = pd.to_datetime('1899-12-31') + df['DATE'].sub(1).apply(pd.offsets.Day)
print(df.head())

# Output
        DATE  OIL  GAS  MISC  TOTAL       Oil       Gas
0 2000-01-07  134  266     4    404  0.331683  0.658416
1 2000-01-14  169  342     6    517  0.326886  0.661509
2 2000-01-21  186  338     6    530  0.350943  0.637736
3 2000-01-28  171  381     6    558  0.306452  0.682796
4 2000-02-04  157  387     1    545  0.288073  0.710092

$-409

Corralien
  • 109,409
  • 8
  • 28
  • 52