2

I have two dataframes in R:

  1. "Labels" contains (a) variable names and (b) descriptive variable labels;
  2. "Data" contains (a) the same variable names and (b) associated data, but no descriptive labels.

I would like to apply the descriptive labels from the "labels" to the variables in "data", but I can't figure out how to do so. Since I have 400+ labels, manually typing them out would take quite a while.

My data looks like this (heavily simplified):

labels <- data.frame(names = c("age", "sex", "year"), labels=c("Age of Participant", "Sex of Participant","Year of Participation"))

data <- data.frame(age=c(12, 14, 16), sex=c(1, 0, 1), year=c(1998, 1997, 1994))

I tried both using the sjlabelled package and applying this technique (R: Assign variable labels of data frame columns) to my data, but I can't figure out how to make these tools apply in this situation.

Note that I am not merely trying to merge the datasets, but would like to apply Stata- or SPSS-like "variable labels" to my variables.

Thanks for your help! - New R User

Anna Jones
  • 91
  • 7
  • I voted to reopen as your edit suggests that it is not a duplicate. Unfortunately, it is 'not clear what you are asking' to me, since I have no idea what stata-like variables labels are. Perhaps you can explain or give what you think the final result should like. – Alex Aug 28 '19 at 03:48
  • Researching further: https://libguides.library.kent.edu/SPSS/DefineVariables I don't think R has what you want. – Alex Aug 28 '19 at 03:50
  • Or: https://stats.idre.ucla.edu/spss/modules/labeling-and-documenting-data/ This definitely seems like something that could be incorporated into tidy output of data but I don't think anything like this has been implemented. – Alex Aug 28 '19 at 03:55

1 Answers1

2

It really depends on when you want to use your variable "labels". While doing your data analysis, you definitely want to keep your short, concise variable names, otherwise you end up in a scenario of

lm(Sex of Participant ~ `Year of Participation`, data=data)

which is not valid syntax, and a heck of a bother to type again and again and agian (whops, typos!).

And when you've finished your analysis, your boss asks you to rename the age "label" to "Participant age", and there goes the analysis until you've searched and replaced every occurrence of the previous variable name.

So, the case should be clear for keeping concise variable names during coding (and you are not arguing against this in your question).

I am guessing you want variable labels for presentation. How to apply variable labels depends entirely on how you are presenting your data. I'll give a few examples.

Output to console:

> data
  age sex year
1  12   1 1998
2  14   0 1997
3  16   1 1994

In this case I would store the labels in a named vector, which also defines the order of the columns. In this case we can

labels <- c(age='Age of participant', sex="Sex of Participant", year="Year of Participation")
present <- data[,names(labels)]
colnames(present) <- labels
> present
  Age of participant Sex of Participant Year of Participation
1                 12                  1                  1998
2                 14                  0                  1997
3                 16                  1                  1994

Plotting data:

plot(data[,c('age','year'])

Want to print proper labels? Use xlab and ylab:

plot(data[,c('age','year'], xlab='Age of participant', ylab='Year of participation')

Plotting data using ggplot2:

Again, the axis labels are polishing and are applied separatly

ggplot(data, aes(x=age, y=year)) + geom_point() + labs(x='Age of participant', y='Year of participation')

And if you wanted to make a really small plot, perhaps you would scoot in a newline (\n) to break the label into two lines.

Formatted tables using xtable:

This is actually the same approach as with "output to console".

Conclusion:

I hope I have convinced you why this is not a trivial answer, that variable labels "are not a thing" in R, because their application differs widely.

Although the renaming example supports the case for having labels. There is however not a structure for containing this meta data throughout the R analysis, as many functions from hoards of packages routinely strips of input data.frames of their attributes.

You are more than welcome to ask a new question here on Stackoverflow when you have a specific use case in mind for displaying labels for variables.

MrGumble
  • 5,631
  • 1
  • 18
  • 33