3

I use R to analyze large data files from IPUMS, which published sophisticated micro-data on Census records. IPUMS offers its extracts as SPSS, SAS or STATA files. To get the data into R, I've had the most luck downloading the SPSS version and using the read.spss function from the "foreign" library:

library(foreign);
ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);

This works brilliantly, save for this perpetual warning:

Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

(If anyone is feeling heroic, I uploaded the zipped .sav file here (39Mb) as well as the .SPS file and the more human-readable codebook. This is just a sample IPUMs extract and, like all IPUMs data, contains no private information.)

My question is whether my data is compromised by duplicate factors in the SPSS file or whether this is something I can fix after the import.

To figure out which of the columns was the culprit, I wrote a little diagnosis:

ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);
for (name in names(ipums)) {
  type <- class(ipums[[name]]);
  if (type == "factor") {
    print(name);
    print(anyDuplicated(levels(ipums[[name]])));    
  }
}

This loop correctly identifies the column BLPD as the culprit. That's a detailed version of a person's birthplace that has 536 possible values in the .SPS file, as confirmed by this code:

fac <- levels(ipums$BPLD)
length(fac)           #536
anyDuplicated(fac)    #153
fac[153]              #"Br. Virgin Islands, ns"

When I look at the .SPS file, I do see in fact that there are two entries for this location:

26052   "Br. Virgin Islands, ns"
26069   "Br. Virgin Islands, ns"

However, I don't see a single instance of this location in the data:

NROW(subset(ipums, ipums$BPLD=="Br. Virgin Islands, ns"))    #0

This may well be because this is not a common location that's likely to show up in the data, but I cannot assume that will always be the case in future project. So part two of my question is whether an SPSS file with duplicate factors will at least important the correct values, or whether a file that produces this warning message is potentially damaged.

As for fixing the problem, I see a few related StackOverflow posts, like this one, but I'm not sure if they address the problem I have with complex public data from a third-party. What is the most efficient way for me to clean up factors with duplicate values so that I can have full confidence in the data?

Community
  • 1
  • 1
Chris Wilson
  • 6,599
  • 8
  • 35
  • 71

2 Answers2

1

SPSS does not require uniqueness of value labels. In this dataset, BLPD is a string. I believe read.spss will create a factor with duplicate levels but will assign all the duplicate values to just one of them. You can use droplevels() after reading the data to get rid of the unused level.

JKP
  • 5,419
  • 13
  • 5
  • 1
    The doc for read.spss says "This was orignally written in 2000 and has limited support for changes in SPSS formats since (which have not been many)." There have been substantial changes in the file format over time. These were done in a backwards-compatible fashion, but be careful reading files with long variable names (> 8 bytes), long strings (> 255 bytes), Unicode encoding (the default since Statistics 21), and extended dataset or variable metadata via custom attributes, among other properties. Note also that you can create an R file directly from Statistics using the R plugin. – JKP Dec 06 '16 at 15:55
  • Gotcha, thanks! I'm going to find an extract with duplications that actually show up in the data itself and test droplevels() and will report back. – Chris Wilson Dec 06 '16 at 16:37
0

Could you try importing and specifying factors as false with either:

#havent tested
read.spss(x...,stringsAsFactors=FALSE)

or from help command for read.spss

read.spss(x...,use.value.labels=FALSE)


?read.spss

#use.value.labels   

#logical: convert variables with value labels into R factors with those levels?        
#This is only done if there are at least as many labels as values of the   
#variable #(when values without a matching label are returned as NA).
Collier
  • 56
  • 3
  • Thanks for response! Unfortunately is does not accept "stringsAsFactors=FALSE." And while it would technically work to say "use.value.labels=FALSE," I then lose all the information about each row. Instead of a state, for example, I just get a number representing that state. Same goes for every categorical column. – Chris Wilson Dec 06 '16 at 05:48