2

I'm working with some code that reads data from xlsx files by parsing the xml. It is all pretty straightforward, with the exception of date cell.

Dates are stored as integers and have an "s" attribute that is an index into the stylesheet, which can be used to get a date formatting string. Here are some examples from a previous stackoverflow question that is linked below:

19 = 'h:mm:ss AM/PM';

20 = 'h:mm';

21 = 'h:mm:ss';

22 = 'm/d/yy h:mm';

These are the built in date formatting strings from the ooxml standard, however it seems like excel tends to use custom formatted strings instead of the builtins. Here is an example format from an Excel 2007 spreadsheet. numFmtId greater than 164 is a custom format.

<numFmt formatCode="MM/DD/YY" numFmtId="165"/>

Determining if a cell should be formatted as a date is difficult because the only indicator I can find is the formatCode. This one is obviously a date, but cells could be formatted any number of ways. My initial attempt is to look for Ms, Ds, and Ys in the formatCode, but that seems problematic.

Has anybody had any luck with this problem? It seems like the standard excel reading libraries are lacking in xlsx support at this time. I've read through the standards and have dug through a lot of xlsx files without much luck.

The best information seems to come from this stackoverflow question:

what indicates an office open xml cell contains a date time value

Thanks!

Community
  • 1
  • 1
Edwin Knuth
  • 143
  • 1
  • 6

2 Answers2

5

Dates are stored as integers

In the Excel data model, there is really no such thing as an integer. Everything is a float. Dates and datetimes are floats, representing days and a fraction since a variable epoch. Times are fractions of a day.

It seems like the standard excel reading libraries are lacking in xlsx support at this time.

google("xlsxrd"). To keep up to date, join the python-excel group.

Edit I see that you have already asked a question there. If you had asked a question there as specific as this one, or responded to my request for clarification, you would have this info over two weeks ago.

Have a look at the xlrd documentation. Up the front there is a discussion on Excel dates. All of it applies to Excel 2007 as well as earlier versions. In particular: it is necessary to parse custom formats. It is necessary to have a table of "standard" format indexes which are for date formats. "Standard" formats listed in some places don't include the formats used in CJK locales.

Options for you:

(1) Borrow from the xlrd source code, including the xldate_as_tuple function.

(2) Option (1) + Get the xlsxrd bolt-on kit and borrow from its source code.

(3) [Recommended] Get the xlsxrd bolt-on kit and use it ... you get a set of APIs that operate across Excel versions 2.0 to 2007 and Python versions 2.1 to 2.7.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Thanks for the info John. I would have followed up on my question, but I was not actually using xlrd so I didn't think it was the appropriate place. I'll check out xlsxrd. – Edwin Knuth Feb 11 '11 at 00:29
  • @Edwin - at the very least, looking at xlsxrd will give you a feel for the logic – Mark Baker Feb 11 '11 at 00:40
  • Also, I've got no trouble doing that math. The issue is recognizing the date formatted cells. – Edwin Knuth Feb 11 '11 at 01:38
  • Ok, thanks to John's xlsxrd code, I was able to solve this problem. xlrd has a method called xlrd.formatting.is_date_format_string() that takes a number format code and returns true if it is actually a date format string. The first argument is a workbook, but we only need a verbosity attribute from it. I set verbosity to zero by passing namedtuple('literal', 'verbosity')(verbosity=0) to the method. This allows the code to access book.verbosity == 0. I'm successfully using an older version of xlrd, fyi. Thanks everyone! – Edwin Knuth Feb 11 '11 at 17:52
0

It isn't enough simply to look for Ms, Ds, and Ys in the number format code

[Red]#,##0 ;[Yellow](#,##0)

is a perfectly valid number format, which contains both Y and D, but isn't a date format. I specifically test for any of the standard date/time formatting characters ('y', 'm', 'd', 'H', 'i', 's') that are outside of square braces ('[' ']'). Even then, I was finding that a few false positives were slipping through, mainly associated with accounting and currency formats. Because these typically begin with either an underscore ('_') or a space followed by a zero (' 0') (neither of which I've ever encountered in a date format, I explicitly filter these values out.

A part of my (PHP) code for determining if a format mask is a date or not:

private static  $possibleDateFormatCharacters = 'ymdHis';

//  Typically number, currency or accounting (or occasionally fraction) formats
if ((substr($pFormatCode,0,1) == '_') || (substr($pFormatCode,0,2) == '0 ')) {
    return false;
}
// Try checking for any of the date formatting characters that don't appear within square braces
if (preg_match('/(^|\])[^\[]*['.self::$possibleDateFormatCharacters.']/i',$pFormatCode)) {
    return true;
}

// No date...
return false;

I'm sure that there may still be exceptions that I'm missing, but (if so) they are probably extreme cases

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • Hi Mark ... should ignore text constants in the format; example: `"yymmddhhmmss haha gotcha"General` – John Machin Feb 10 '11 at 09:50
  • @John - true enough... thanks for the "heads up"... (which would give a false positive). I'll add that test to the PHPExcel validator. As far as I'm aware, nobody has reported any false positives that are a result of quoted strings, but it's only a matter of time. Now, how to modify my regexp :( – Mark Baker Feb 10 '11 at 09:55
  • Yeah, I was specifically worried about the presence of d in Red :( – Edwin Knuth Feb 11 '11 at 00:33
  • What is this `i` format code? I can't find it in any documentation. My Excel 2003 and 2007 both treat it as a literal e.g. formatting 0.666666 with custom format `hh:ii:ss` produces `16:ii:00` ... locale is `English (Australia)`. – John Machin Feb 11 '11 at 19:23
  • @John - I'm sure there was a reason for the i at the time when I originally wrote it, but can't find any reference to it myself now. My vague recollection was that it was very locale specific, for example Hijri dates, but I've checked those and they doesn't use it... so I'm somewhat at a loss to explain – Mark Baker Feb 11 '11 at 21:30