I use R to analyze large data files from IPUMS, which published sophisticated micro-data on Census records. IPUMS offers its extracts as SPSS, SAS or STATA files. To get the data into R, I've had the most luck downloading the SPSS version and using the read.spss
function from the "foreign" library:
library(foreign);
ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);
This works brilliantly, save for this perpetual warning:
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
(If anyone is feeling heroic, I uploaded the zipped .sav file here (39Mb) as well as the .SPS file and the more human-readable codebook. This is just a sample IPUMs extract and, like all IPUMs data, contains no private information.)
My question is whether my data is compromised by duplicate factors in the SPSS file or whether this is something I can fix after the import.
To figure out which of the columns was the culprit, I wrote a little diagnosis:
ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);
for (name in names(ipums)) {
type <- class(ipums[[name]]);
if (type == "factor") {
print(name);
print(anyDuplicated(levels(ipums[[name]])));
}
}
This loop correctly identifies the column BLPD
as the culprit. That's a detailed version of a person's birthplace that has 536 possible values in the .SPS file, as confirmed by this code:
fac <- levels(ipums$BPLD)
length(fac) #536
anyDuplicated(fac) #153
fac[153] #"Br. Virgin Islands, ns"
When I look at the .SPS file, I do see in fact that there are two entries for this location:
26052 "Br. Virgin Islands, ns"
26069 "Br. Virgin Islands, ns"
However, I don't see a single instance of this location in the data:
NROW(subset(ipums, ipums$BPLD=="Br. Virgin Islands, ns")) #0
This may well be because this is not a common location that's likely to show up in the data, but I cannot assume that will always be the case in future project. So part two of my question is whether an SPSS file with duplicate factors will at least important the correct values, or whether a file that produces this warning message is potentially damaged.
As for fixing the problem, I see a few related StackOverflow posts, like this one, but I'm not sure if they address the problem I have with complex public data from a third-party. What is the most efficient way for me to clean up factors with duplicate values so that I can have full confidence in the data?