3

I'm trying to convert a directory full of XLSX files to CSV. Everything is working except I'm encountering an issue with the columns containing time information. The XLSX file is being created by another program that I can not modify. But I want to maintain the same times that show up when I view the XLSX file in Excel as when it is converted to CSV and viewed in any text editor.

My code:

import csv
import xlrd
import os
import fnmatch
import Tkinter, tkFileDialog, tkMessageBox

def main():
    root = Tkinter.Tk()
    root.withdraw()
    print 'Starting .xslx to .csv conversion'
    directory = tkFileDialog.askdirectory()
    for fileName in os.listdir(directory):
        if fnmatch.fnmatch(fileName, '*.xlsx'):
            filePath = os.path.join(directory, fileName)
            saveFile = os.path.splitext(filePath)[0]+".csv"
            savePath = os.path.join(directory, saveFile)
            workbook = xlrd.open_workbook(filePath)
            sheet = workbook.sheet_by_index(0)
            csvOutput = open(savePath, 'wb')
            csvWriter = csv.writer(csvOutput, quoting=csv.QUOTE_ALL)
            for row in xrange(sheet.nrows):
                csvWriter.writerow(sheet.row_values(row))
            csvOutput.close()
    print '.csv conversion complete'

main()

To add some detail, if I open one file in Excel I see this in a time column:

00:10.3
00:14.2
00:16.1
00:20.0
00:22.0

But after I convert to CSV I see this in the same location:

0.000118981
0.000164005
0.000186227
0.000231597
0.000254861

Thanks to seanmhanson with his answer https://stackoverflow.com/a/25149562/1858351 I was able to figure out that Excel is dumping the times as decimals of a day. While I should try to learn and use xlrd better, for a quick short term fix I was instead able to convert that into seconds and then from seconds back into the time format originally seen of HH:MM:SS. My (probably ugly) code below in case anyone might be able to use it:

import csv
import xlrd
import os
import fnmatch
from decimal import Decimal
import Tkinter, tkFileDialog

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

def seconds_to_hms(seconds):
    input = Decimal(seconds)
    m, s = divmod(input, 60)
    h, m = divmod(m, 60)
    hm = "%02d:%02d:%02.2f" % (h, m, s)
    return hm

def main():
    root = Tkinter.Tk()
    root.withdraw()
    print 'Starting .xslx to .csv conversion'
    directory = tkFileDialog.askdirectory()
    for fileName in os.listdir(directory):
        if fnmatch.fnmatch(fileName, '*.xlsx'):
            filePath = os.path.join(directory, fileName)
            saveFile = os.path.splitext(filePath)[0]+".csv"
            savePath = os.path.join(directory, saveFile)
            workbook = xlrd.open_workbook(filePath)
            sheet = workbook.sheet_by_index(0)
            csvOutput = open(savePath, 'wb')
            csvWriter = csv.writer(csvOutput, quoting=csv.QUOTE_ALL)
            rowData = []
            for rownum in range(sheet.nrows):
                rows = sheet.row_values(rownum)
                for cell in rows:
                    if is_number(cell):
                        seconds = float(cell)*float(86400)
                        hms = seconds_to_hms(seconds)
                        rowData.append((hms))
                    else:
                        rowData.append((cell))
                csvWriter.writerow(rowData)
                rowData = []
            csvOutput.close()
    print '.csv conversion complete'

main()
Community
  • 1
  • 1
stoves
  • 778
  • 1
  • 11
  • 25
  • Have you tried to convert the time column to straight text before exporting it? – VanCowboy Aug 05 '14 at 22:32
  • @VanCowboy the issue was that excel was providing a formatted view of the underlying data. the raw data was really a decimal so I don't think converting to straight text would have solved my issue. the answer I selected below explains this better and I changed my code to address that above – stoves Aug 06 '14 at 17:31

1 Answers1

3

Excel stores time as a float in terms of days. You will need to use XLRD to determine if a cell is a date, and then convert it as needed. I'm not great with XLRD, but you might want something akin to this, changing the string formatting if you want to keep leading zeroes:

if cell.ctype == xlrd.XL_CELL_DATE:
    try: 
        cell_tuple = xldate_as_tuple(cell, 0)
        return "{hours}:{minutes}:{seconds}".format(
            hours=cell_tuple[3], minutes=cell_tuple[4], seconds=cell_tuple[5])
    except (any exceptions thrown by xldate_as_tuple):
        //exception handling

The XLRD date to tuple method's documentation can be found here: https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966#xldate.xldate_as_tuple-function

For a similar issue already answered, see also this question: Python: xlrd discerning dates from floats

Community
  • 1
  • 1
seanmhanson
  • 219
  • 1
  • 6
  • 1
    Your answer helped me solve my problem so I'm marking it as such. I wound up going a different route that was a faster for me to implement (posted code above as such) but I should probably learn to use xlrd better for the long run. – stoves Aug 06 '14 at 17:27
  • Yeah, I agree that the method above is definitely not the best solution for your needs, (I'd just multiply and then format the number of seconds) but it's good to know it exists / there are methods you should check out for XLRD. If you did something else you should post it! – seanmhanson Aug 07 '14 at 18:25
  • xlrd 0.9.3 was released April 2014, and it contains the function `xldate.xldate_as_datetime`, which converts an Excel date directly to a Python datetime. Still, this is a relatively minor convenience, as it's easy enough to create a date, time, or datetime from a tuple. The main thing to know is that Python provides a method, [`strftime`](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior), which allows easy conversion to strings, in various formats. – John Y Aug 25 '14 at 22:03