19

The xlsx package is reading dates in wrongly. I've read all the top similar Q's here and had a scout round the internet but I can't find this particular behaviour where the origin changes if there's non-date data in a column.

I have a tiny Excel spreadsheet you can get from dropbox:

https://www.dropbox.com/s/872q9mzb5uzukws/test.xlsx

It has three rows, two columns. First is a date, second is a number. The third row has "Grand Total" in the date column.

If I read in the first two rows with read.xlsx and tell it the first column is a date then this works:

read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("Date","integer"),endRow=2)
          X1 X2
1 2014-06-29 49
2 2014-06-30 46

Those are indeed the dates in the spreadsheet. If I try and read all three rows, something goes wrong:

read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("Date","integer"))
          X1    X2
1 2084-06-30    49
2 2084-07-01    46
3       <NA> 89251
Warning message:
In as.POSIXlt.Date(x) : NAs introduced by coercion

If I try reading in as integers I get different integers:

> read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("integer","integer"),endRow=2)
     X1 X2
1 16250 49
2 16251 46
> read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("integer","integer"))
     X1    X2
1 41819    49
2 41820    46
3    NA 89251

The first integers are correctly converted using as.Date(s1$X1,origin="1970-01-01") (Unix epoch) and the second integers are correctly converted using as.Date(s2$X1, origin="1899-12-30") (Excel epoch). If I convert the second lot using 1970 I get the 2084 dates.

So: Am I doing something wrong? Is the best thing to read as integers, and if any NAs then convert using Excel epoch, otherwise use Unix epoch? Or is it a bug in the xlsx package?

xlsx version is Version: 0.5.1

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • I was about to recommend the `XLConnect` package, but that seems to have its own problems - I can't get it to read the *first* row: `readWorksheet(loadWorkbook("test.xlsx"),"Sheet1",startRow=0)`. Weird. – Stephan Kolassa Aug 06 '14 at 11:27
  • @StephanKolassa per default `readWorksheet` has set `header = TRUE`. – Beasterfield Aug 06 '14 at 11:42
  • It is almost certainly a bug in `xlxs::read.xlsx`. Note that if you specify `as.data.frame=FALSE` to `read.xlsx`, the in all 4 cases (with and without the third row and with specification of `"Date"` or `"integer"`), the numerical values are `41819` or `41820`. I'd file an issue with the maintainer. – Brian Diggs Aug 06 '14 at 15:17

3 Answers3

18

The dates can be read as integers and later converted to Date using openxlsx::convertToDate() function.

More here

Cristian E. Nuno
  • 2,822
  • 2
  • 19
  • 33
Ali
  • 7,810
  • 12
  • 42
  • 65
4

XLConnect is able to handle this pretty sweet:

test <- readWorksheetFromFile( "~/Downloads/test.xlsx", sheet = "Sheet1", header = FALSE )
test
                 Col1  Col2
1 2014-06-29 00:00:00    49
2 2014-06-30 00:00:00    46
3         Grand Total 89251

The Problem you have is obvious, that the first column is of mixed type: character and POSIXct. XLConnect is able to read each cell correctly in, but casts all cells of a column then to the most common type, which is character in this case.

str(test)
'data.frame':   3 obs. of  2 variables:
 $ Col1: chr  "2014-06-29 00:00:00" "2014-06-30 00:00:00" "Grand Total"
 $ Col2: num  49 46 89251
Beasterfield
  • 7,023
  • 2
  • 38
  • 47
  • 1
    yeah that is a problem with R in general, R does not have proper multidimensional list support. you can have listed lists, matrix, and data.frames, but all have their issues. – phonixor Aug 06 '14 at 11:56
  • 1
    adding `colTypes=c("Date","integer")` seems to do the correct conversion and sets the last item to `NA`. I don't really care about the text in the last item. – Spacedman Aug 06 '14 at 12:06
3

The problem you're having is that Excel stores the number of days since Jan-0-1900, and that is the number R is reading from the excel file. When you convert in R, you are converting based on the number of days since Jan-1-1970. If you subtract the number of days between those two first, it should work.

Tracy
  • 31
  • 1