I am trying to convert an xlsx file to one CSV file containing the header and another CSV file containing the actual data. I have the following requirements:
- Header does not start at first row but at row
start_line
. - Dates should not be considered as floats but in some string format.
- I don't know the total number of rows or columns of the file beforehand. I also don't want to specify which column is a date.
Using pandas
I get stuck at number 1.
I wanted to achieve this in two separate reads where I read from start_line to start_line+1
and from start_line+1
to the end.
However it seems like it is not possible to read n lines from an offset. Below is the code I use to just get one file including the header.
import pandas as pd
def parse_excel(file,start_line,sheet,table):
sh = pd.read_excel(file,sheet,skiprows=start_line)
sh.to_csv("output.csv",sep='\t',encoding='utf-8',index=False)
Next I have tried this using xlrd
but this library treats all dates as floats like in Excel. The only workaround here seems to go through all individual cells which does not seem very efficient or well coded. What I have now:
import xlrd
def parse_excel(file,start_line,sheet,table):
with xlrd.open_workbook(file) as wb:
sh = wb.sheet_by_name(sheet)
header_written = False
with open('{0}.csv'.format(table),'wb') as csv_file:
wr = csv.writer(csv_file,delimiter='\t')
for rownum in range(sh.nrows):
if not header_written and start_line == rownum:
with open('{0}_header.csv'.format(table),'wb') as header:
hwr = csv.writer(header,delimiter='\t')
hwr.writerow(sh.row_values(rownum))
header_written = True
elif header_written:
wr.writerow(sh.row_values(rownum))
Please point me to other solutions/libraries, show a workaround for either one of the above or explain why I should go for the xlrd
workaround checking each individual cell.