1

Recently I have switched from STATA to R.

In STATA, you have something called value label. Using the command encode for example allows you to turn a string variable into a numeric, with a string label attached to each number. Since string variables contain names (which repeat themselves most of the time), using value labels allows you to save a lot of space when dealing with large dataset.

Unfortunately, I did not manage to find a similar command in R. The only package I have found that could attach labels to my values vector is sjlabelled. It does the attachment but when I’m trying to merge attached numeric vector to another dataframe, the labels seems to “fall of”.

Example: Start with a string variable.

paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
          # Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences       <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences       <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences)       # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”

Thanks!

P.S. Sorry about the inelegant code, as I said before, I'm pretty new in R.

Marco
  • 2,368
  • 6
  • 22
  • 48
David Harar
  • 301
  • 2
  • 12
  • Are labels the same as factors in R? See e.g. `?factor` and https://www.stat.berkeley.edu/~s133/factors.html, and `str(sentrnces_df$sentences)`. – Otto Kässi Oct 12 '18 at 11:18
  • 1
    It apperas that saveing strings as factors no longer results in memory gain .... – Wimpel Oct 12 '18 at 11:22

2 Answers2

0

source: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

... Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use ‘factors’ for the modeling functions. ...

Wimpel
  • 26,031
  • 1
  • 20
  • 37
0

I adjusted your initial test data a little bit. I was confused by so many strings and am unsure whether they are necessary for this issue. Let me know, if I missed a point. Here is my adjustment and the answer:

#####################################
# initial problem rephrased
#####################################

# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)

# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))

# show labels in this frame
get_labels(df1)

# include associated values
get_labels(df1, values = "as.prefix")

# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))

# labels lost after merge
get_labels(df_merge, values = "as.prefix")

#####################################
# solution with dplyr 
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")

Solution attributed to:

Merging and keeping variable labels in R

Marco
  • 2,368
  • 6
  • 22
  • 48