In Stata I am able to use the codebookout
command to create an Excel workbook that saves name, label, and storage type of all the variables in the existing dataset with their corresponding values and value labels.
I would like to find an equivalent function in R. So far, I've encountered the memisc
library which has a function called codebook
, but it does not do the same thing as in Stata.
For example, In Stata, the output of the codebook would look like this...(see below - this is what I want)
Variable Name Variable Label Answer Label Answer Code Variable Type
hhid hhid Open ended String
inter_month inter_month Open ended long
year year Open ended long
org_unit org_unit long
Balaka 1
Blantyre 2
Chikwawa 3
Chiradzulu 4
i.e. each column in the data frame is evaluated to produce values for 5 different columns:
- Variable Name which is the name of the column
- Variable Label which is the name of the column
- Answer Label which is the unique values in the column. If there are no unique values, it is considered open ended
- Answer Code which is the numerical assignment to each category in the Answer Label. Blank if the Answer Label is not categorical.
- Variable Type: int, str, long (date)...
Here is my attempt:
CreateCodebook <- function(dF){
numbercols <- length(colnames(dF))
table <- data.frame()
for (i in 1:length(colnames(dF))){
AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i],
rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]
df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType)
names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
table <- rbind(table, df)
}
return(table)
}
Unfortunately, I am getting the following warning message:
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) :
invalid factor level, NA generated
The output I produce results in the Answer Code label getting messed up:
Variable Name Variable Label Variable Type Answer Code Answer Label
hhid hhid hhid Open ended character
month month month Open ended integer
year year year Open ended integer
org_unit org_unit org_unit Open ended character
v000 v000 v000 Open ended character
v001 v001 v001 Open ended integer
v002 v002 v002 Open ended integer
v003 v003 v003 Open ended integer
v005 v005 v005 Open ended integer
v006 v006 v006 Open ended integer
v007 v007 v007 Open ended integer
v021 v021 v021 Open ended numeric
2285 v024 v024 central <NA> factor
1 north <NA>
7119 south <NA>
11 v025 v025 rural <NA> factor
1048 v025 v025 urban <NA> factor
district_name district_name district_name Open ended character
coords_x1 coords_x1 coords_x1 Open ended numeric
coords_x2 coords_x2 coords_x2 Open ended numeric
itn_color itn_color itn_color Open ended numeric
piped piped piped Open ended numeric
sanit sanit sanit Open ended numeric
sanit_cd sanit_cd sanit_cd Open ended numeric
water water water Open ended numeric