0

I am trying to convert an xlsx file to one CSV file containing the header and another CSV file containing the actual data. I have the following requirements:

  1. Header does not start at first row but at row start_line.
  2. Dates should not be considered as floats but in some string format.
  3. I don't know the total number of rows or columns of the file beforehand. I also don't want to specify which column is a date.

Using pandas I get stuck at number 1. I wanted to achieve this in two separate reads where I read from start_line to start_line+1 and from start_line+1 to the end. However it seems like it is not possible to read n lines from an offset. Below is the code I use to just get one file including the header.

import pandas as pd
def parse_excel(file,start_line,sheet,table):
    sh = pd.read_excel(file,sheet,skiprows=start_line)
    sh.to_csv("output.csv",sep='\t',encoding='utf-8',index=False)

Next I have tried this using xlrd but this library treats all dates as floats like in Excel. The only workaround here seems to go through all individual cells which does not seem very efficient or well coded. What I have now:

import xlrd
def parse_excel(file,start_line,sheet,table):
    with xlrd.open_workbook(file) as wb:
        sh = wb.sheet_by_name(sheet)
        header_written = False
        with open('{0}.csv'.format(table),'wb') as csv_file:
            wr = csv.writer(csv_file,delimiter='\t')
            for rownum in range(sh.nrows):
                if not header_written and start_line == rownum:
                    with open('{0}_header.csv'.format(table),'wb') as header:
                        hwr = csv.writer(header,delimiter='\t')
                        hwr.writerow(sh.row_values(rownum))
                        header_written = True
                elif header_written:
                    wr.writerow(sh.row_values(rownum))

Please point me to other solutions/libraries, show a workaround for either one of the above or explain why I should go for the xlrd workaround checking each individual cell.

Community
  • 1
  • 1
spijs
  • 1,489
  • 18
  • 36
  • You should be able to use the `header` parameter to set the line number for the column names. Combine this with `skip_rows` and I think that should take care of number 1. Use `parse_dates` to pass a list of the columns that are dates. To convert a numeric to date, I believe you will need to do this in a second line of code with `pd.to_datetime` and setting the `unit` parameter. – Ted Petrou Mar 31 '17 at 14:23
  • @TedPetrou `pandas` in fact treats the date exactly as I want it :). Can you show me how i can use the `header` parameter and extract the header? That would indeed solve my issue. – spijs Mar 31 '17 at 14:25
  • Use an integer for the row number where your column names reside. See the docs for more detail: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html – Ted Petrou Mar 31 '17 at 14:28
  • @TedPetrou I am quite unfamiliar with pandas. How should I then write the header to csv? Or will my output only contain the header when I use `to_csv`? – spijs Mar 31 '17 at 14:30

1 Answers1

1

As long as all of your data is below your header row then following should work. Assuming the header row is at row n (indexing beginning at 0 not 1 like excel).

df = pd.read_excel('filepath', header=n)
df.head(0).to_csv('header.csv', index=False)
df.to_csv('output.csv', header=None, index=False)
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136