13

I'm importing xls files using gdata. I am converting date columns using as.Date to convert the date

As per the manual for as.Date, the date origin is platform dependent, and so I am determining which origin to use accordingly

.origin <- ifelse(Sys.info()[['sysname']] == "Windows", "1899-12-30", "1904-01-01")
as.Date(myData$Date, origin=.origin)

However, I'm wondering if I should be considering the platform where the file is being read or the platform where it was written?

For what it's worth, I am currently testing the code on a linux box with no excel, and the correct Dates are produced by using origin="1904-01-01"


Quoting `?as.Date'

  ## date given as number of days since 1900-01-01 (a date in 1989)
  as.Date(32768, origin = "1900-01-01")
  ## Excel is said to use 1900-01-01 as day 1 (Windows default) or
  ## 1904-01-01 as day 0 (Mac default), but this is complicated by Excel
  ## treating 1900 as a leap year.
  ## So for dates (post-1901) from Windows Excel
  as.Date(35981, origin = "1899-12-30") # 1998-07-05
  ## and Mac Excel
  as.Date(34519, origin = "1904-01-01") # 1998-07-05
  ## (these values come from http://support.microsoft.com/kb/214330)

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • 2
    Is there any way to import the date columns as character vectors, and then convert that? Might be easier... – Matt Parker Mar 28 '13 at 16:07
  • 1
    I'm pretty sure it is the system on which it is written. Also, those are the defaults for different systems, but it is possible to change that within a single Excel file as well. So this is not guaranteed to work in all cases. – Brian Diggs Mar 28 '13 at 16:08
  • 1
    @MattParker, I had actually given that a go using `colClasses="character"` but the date still imports with the same value (except of course, now it is a string) – Ricardo Saporta Mar 28 '13 at 16:17
  • @BrianDiggs, thanks for pointing out that the excel defaults can be changed, I'll keep an eye out for that. For now, I am working off the assumption that I'm dealing with default settings. – Ricardo Saporta Mar 28 '13 at 16:18
  • @RicardoSaporta, have you tried the `xlsx` package? I just tested it with a file saved as "Excel 97-2003", using Excel 2010 @ Win7, and it correctly identified the origin as "1899-12-30". – Ferdinand.kraft Mar 28 '13 at 19:31
  • @Ferdinand.kraft, I have not tried it yet, but I will give it a shot now. To be clear, you saved it on a windows system and opened it on Linux or Mac? – Ricardo Saporta Mar 28 '13 at 19:37
  • I saved in excel 2010 and opened in R 2.15.2, both on Win7. But I can't believe the platform where the file is *read* is relevant. – Ferdinand.kraft Mar 28 '13 at 19:45
  • @Ferdinand.kraft, unfortunately, `xlsx::read.xlsx` is a no-go. It is (apparently) defaulting dates to the Windows setup. Conceivably, this is worse than leaving the dates in serial fashion, since if one is not aware of the conventions nor of which dates to expect, they may not notice that their dates are off by 4 years. Alternatively, `xlsx::read.xlsx2` does leave the date column as a serial. – Ricardo Saporta Mar 28 '13 at 20:02

2 Answers2

5

You could try out the (extremely) new exell package: https://github.com/hadley/exell. It loads excel dates into POSIXct, correctly choosing the origin based on whether the file was written by Windows or Mac Excel.

hadley
  • 102,019
  • 32
  • 183
  • 245
3

Yes, you should be considering where the file was written. Excel-Windows appears capable of distinguishing Mac-written dates from Win-written dates, but you are getting evidence that these are Mac-originated .xls files.

The safest method would be to work within the version of Excel on which the data was entered and to use the format menu to bring up a dialog box from which you choose as-Date and a custom format of yyyy-mm-dd. Then save as a csv file and you will be able to import into R with the colClasses vector "Date" in the proper column position. But that sounds as though it is an option not available.

I suppose it doesn't apply to you on a linux box so this is just a Mac-whine: The gdata-package gives deprecation warnings and then fails to install the XLSX support files on R 3.0.0 with the ordinary Perl 5.8 installation in '/opt/local/bin/perl'. This despite 'gdata::findPerl` being able to find it successfully.

At this point I think the question should be redirected to inquiring whether you could coax gdata functions into inspecting the properties of the files. After looking at the codebase for xls reading, I rather doubt it, since do not see any mention of inspecting for different xls versions.

Near the end of a blank xls file created with a Mac version of Excel, looking with a text editor I see:

Worksheets˛ˇˇˇˇˇ ¿F$Microsoft Excel 97 - 2004 Worksheet˛ˇˇˇ8FIBExcel.Sheet.8˛ˇ
‡ÖüÚ˘Oh´ë+'≥Ÿ0îHPhħ
∞ºƒ'David WinsemiusDavid WinsemiusMicrosoft Macintosh Excel@ê˚á!Ë+Œ@ê'å-Ë+ŒG»˛ˇˇˇPICT¿Kġ

The other difference was that the Windows version inspected the same way was had "Excel 2003 Worksheet" as teh type of worksheet, whereas it was "Excel 97 - 2004" for the Mac version. So maybe you can coerce R into bypassing all the errors that get triggered when reading or grepping during scanning for "Macintosh". Maybe Linux-R is more resistant to that sort of thing?

Error: invalid multibyte string at '<ff>'

I also got a bunch of warnings from grep that suggested I might not be able to "see" into some of the strings:

Warning message:
In grep("Macintosh", lin) : input string 1 is invalid in this locale

You might be able to highjack some more robust code from the Perl code in xls2csv.pl which is part of the gdata package.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • interesting, thank you david. If it is dependent on where the `xls` file was written, supposedly the file itself should have some indication of its origin. I'll dig around for that. – Ricardo Saporta Mar 28 '13 at 19:05
  • And yes, you are correct, accessing the original source is not available to me – Ricardo Saporta Mar 28 '13 at 19:06