1

I am importing data from SPSS (sav file). In this data set there are several date columns, several of which contain cells with missing values. All date columns are imported as date variables into R, so far so good. All date entries from SPSS are imported correctly into R, still so far so good. However, some of the missing values in SPSS are interpreted as 1582-10-14 when imported to R. I understand this is the start of the Gregorian calendar, so I thought maybe some of the cells in SPSS are 0 instead of NA, but when checking said cells, they are indeed empty, and look exactly the same as other empty cells which are correctly imported as NA in R.

I am sorry I can't think of a reproducible example and I can't put my SPSS file here, integrityreasons.

I have tried importing using rio (import_list) and haven(read_sav), and the same cells are interpreted as 1582-10-14 when using both methods. Maybe it is not random and there is some hidden data in the SPSS file not visible in the SPSS interface?

Any idea why I get this error? Could it be a bug? My experience though is that it seldom is. Has anyone else experienced this?

A possible fix would be just to remove all 1582-10-14 entries in R but I would like know the reason why it occurred. What if more data imported from SPSS is incorrect?

Edit: This post has been suggested as a solution, which it is not. That post concerns R importing dates from SPSS as absolute numbers in seconds from 1582-10-14 and how to convert these values to actual dates, if I understand correctly. This post concerns R incorrectly interpreting some of the missing values of an SPSS date column as 1582-10-14.

Here is a screenshot of how the data looks in SPSS and a dput() of how it looks after importing to R.

Screenshot of how the data in SPSS

Link to SPSS dataset

datetest <- structure(list(datevariable = structure(c(NA, 18394, 
 NA, 17914, -141428, -141428, NA, -141428, NA, NA, NA, 
 -141428, NA, NA, -141428, -141428, NA, NA, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17765, NA, NA, 
 NA, NA, NA, NA, NA, -141428, NA, NA, NA, NA, NA, NA, NA,
 19278, NA, NA, NA, NA, NA, NA, NA, 19121, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
 NA, NA, NA, NA), label = "datevariable", class = "Date", format.spss = "SDATE10")),
     class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -77L))

As you can see row numbers 2 and 4 are imported correctly. The missing values of row 5,6,8,12,15... are imported as 1582-10-14.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Max
  • 91
  • 7
  • If you have access to SPSS, can you create a version of your SPSS data set that consists *only* of a date column with examples of bad-missing, good-missing, and non-missing cells (hence should be anonymous/non-identifiable/shareable)? This is going to be very hard to debug without a reproducible example. – Ben Bolker Nov 02 '22 at 16:45
  • 1
    Here is a prior post addressing this for you: https://stackoverflow.com/questions/37880975/spss-date-format-when-imported-into-r – sconfluentus Nov 02 '22 at 17:00
  • @BenBolker , Excellent idea. I have managed to create a SPSS file with only one date variable but I can't upload files to stackoverflow, will try to upload somewhere else and link. In the meantime, I have edit the question with a screenshot from SPSS and dput of the imported df in R to highlight the issue – Max Nov 02 '22 at 20:42
  • @sconfluentus , This post does not adress the same issue and the suggested solution does not work for me. I have clarified this in the question. – Max Nov 02 '22 at 20:43
  • This is a very useful edit, but the screenshot really doesn't tell us much that we didn't already know; if you can link this toy data set somewhere that would be great. This looks like a possibility: https://easyupload.io/ (link will only be good for 30 days but that seems OK) – Ben Bolker Nov 02 '22 at 21:02
  • @BenBolker Perfect. I uploaded and here is the link: https://easyupload.io/ojb4rs – Max Nov 02 '22 at 21:37

1 Answers1

1

I have an incomplete and possibly unsatisfying answer.

I downloaded your file and ran the following code:

library(foreign)
library(rio)
x1 <- read.spss("spss_test.sav")
x2 <- rio::import("spss_test.sav")
x3 <- haven::read_spss("spss_test.sav")

In fact, rio is just a wrapper for the code in the haven or (alternatively, if requested) foreign packages, so x2 is redundant ... I inspected the results, using the print_hex() function defined below, and viewing the file in "hexl-mode" in Emacs (I imagine you can find other hex-editors online etc.). The advantage of comparing foreign::read.spss() and haven::read_spss() is that read.spss actually does less processing of the data (producing weird long numbers that can then be translated into dates), which makes it a little easier to see what's going on.

Looking at the results, comparing the first few elements of x1[[i]] (numeric value); print_hex(x1[[i]]) (underlying hex representation); and x3[[i]] (date value), we get

  • elements 1 and 3: NA, 7ff00000000007a2, NA
  • element 2: 13808620800, 4209b876a8000000, 2020-05-12
  • element 4: 13767148800 , 4209a4b028000000, 2019-01-18
  • element 5: 0, 0, 1582-10-14

So it's definitely the case that some of the missing values are being encoded by SPSS as zero and some as an NA value. Unfortunately, why the missing values are being encoded differently is a mystery to me.

I tried looking at the raw hex code dump, but it's painful and doesn't help very much. At the end of line 2 below, you can see the hex value corresponding to element 2 coded as 00 0000 a876 b809 42, which is in backwards reversed order, i.e. start with the last pair (42) and read each pair going right-to-left (09, b8, 76 ...). The value for element 4 seems to be right after it (end of line 2, beginning of line 3). This is about where I get lost, because I don't know what kind of compression/other magic is being done in the file encoding ...

01           0000 5554 462d 38e7 0300 0000 0000 00ff  ..UTF-8.........
02 000001f0: fdff fd64 64ff 6400 0000 a876 b809 4200  ...dd.d....v..B.
03 00000200: 0000 28b0 a409 42ff ffff 64ff ff64 64ff  ..(...B...d..dd.
04 00000210: ffff ffff ffff ffff ffff ffff ffff fd00  ................
05 00000220: 0000 ac8c 9e09 42ff ffff ffff ffff 64ff  ......B.......d.

From Double precision (64-bit) representation of numeric value in R (sign, exponent, significand) :

Rcpp::cppFunction('void print_hex(double x) {
    uint64_t y;
    static_assert(sizeof x == sizeof y, "Size does not match!");
    std::memcpy(&y, &x, sizeof y);
    Rcpp::Rcout << std::hex << y << std::endl;
}', plugins = "cpp11", includes = "#include <cstdint>")
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thank you for your effort. I have to admit I do not understand everything, but it seems then that the problem lies in the spss file. Next time I have acces to spss I will try to run some descriptive statistics to see whether the error appears there as well. – Max Nov 03 '22 at 20:15