-4

Im facing the following problem when im trying to run my regression model:

The two datasets are quite big to post them, so i will give a view of the "merged_data":

merged_data

GemData <- read_dta(("C:/Users/I/Documents//GEM Dataset.dta"))

GlobeData <- read_excel("GLOBE-Phase-2-Aggregated-Societal-Culture-Data.xls")

> dput(head(reference_iso))
structure(list(name = c("Afghanistan", "Aland Islands", "Albania", 
"Algeria", "American Samoa", "Andorra"), alpha.3 = c("AFG", "ALA", 
"ALB", "DZA", "ASM", "AND")), row.names = c(NA, 6L), class = "data.frame")
> merged_data <- GlobeData %>% 
+   left_join(reference_iso, by = c('Country Name' = 'name')) %>% 
+   rename(iso3 = 'alpha.3') %>% 
+   left_join(GemData, by = c('iso3' = 'cntry') )
> model1 <- lm(all_high_stat_entre ~ Uncertainty Avoidance Societal Practices ,data=merged_data)
Error: unexpected symbol in "model1 <- lm(all_high_stat_entre ~ Uncertainty Avoidance"

Any advice for this error appearance ?

Nikolas
  • 1
  • 1
  • Please do not post photos of data or code! If you do, people who are willing to help you would have to type out all that text. Instead provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) P.S. Here is [a good overview on how to ask a good question](https://stackoverflow.com/help/how-to-ask) – dario Oct 08 '21 at 14:49
  • 1
    Also, you need to either enclose your variable names that have spaces in them with backticks or if they are individual variables put operators between them (i.e. `all_high_stat_entre ~ Uncertainty + Avoidance + Societal + Practices `) – dario Oct 08 '21 at 14:51
  • sir i explained you the datasets are very very big.. – Nikolas Oct 08 '21 at 14:51
  • @dario no sir the problem is that the names of variables are big – Nikolas Oct 08 '21 at 14:52
  • 1
    Please check the link on MRE I gave you! A MRE is **by definition** not your whole data but only as much as is necessary to show the problem! Please use the information given to you – dario Oct 08 '21 at 14:53
  • @dario maybe thats the wrong part , you can the names of variables in the photo pleaze, thanks – Nikolas Oct 08 '21 at 14:53
  • i know sir the MRE rules and i had tried the dput() but the structure that has been shown isnt good .. – Nikolas Oct 08 '21 at 14:54

1 Answers1

2

As was mentioned, your "big variable names" cannot be referenced ad-hoc in a formula. While I don't know if this is right (pic of data does not include enough context), I suspect all you need to do is enclose all space-including variables in backticks, as in

model1 <- lm(all_high_stat_entre ~ `Uncertainty Avoidance Societal Practices`,
             data=merged_data)

Demonstration:

mt <- mtcars
names(mt)[2] <- "c yl"
head(mt, 3)
#      mpg  c yl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
# 2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
# 3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1

lm(mpg ~ c yl + disp, data = mtcars)
# Error: unexpected symbol in "lm(mpg ~ c yl"
# x

lm(mpg ~ `c yl` + disp, data = mt)
# Call:
# lm(formula = mpg ~ `c yl` + disp, data = mt)
# Coefficients:
# (Intercept)       `c yl`         disp  
#    34.66099     -1.58728     -0.02058  

Why?

Think of this from a language-parsing viewpoint: "tokens" that are literal numbers, variables, or functions must be delimited by something. In most cases, this needs to be an infix operator, a paren, or a comma.

Examples:

  • c(1 2) does not work since we want 1 and 2 to be distinct, so we use a comma.

  • mean 2 should be mean(2), where the paren separates them. We can optionally include spaces here, mean (2) and mean( 2) work just fine, so the spaces here are ignored.

  • if we have two variables x and y, then we can do x + y or x+y, where the infix + clearly/obviously separates them.

In general, though, not many things (any?) in R are solely space-separated. 1 2, var1 var2, and similar are parsing errors. If we have a variable that has a space (or is otherwise not compliant with https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-are-valid-names_003f), then we must inform R how to include the spaces, and that is typically done with backticks.

`a b` <- 1
a b
# Error: unexpected symbol in "a b"
# x
`a b`
# [1] 1

In some places, we can use quotes, but backticks also work.

zz <- setNames(list(11, 12), c("a b", "c d"))
zz$`a b`
# [1] 11
zz$"c d"
# [1] 12
zz[["c d"]]
# [1] 12
zz[[`c d`]]
# Error: object 'c d' not found

Noting that backticks are not always appropriate: in some locations, they push R to look for an object with that name. Had we done zz[[`a b`]] here, it would not have erred, but that's because in the previous code block I created a variable named `a b`, and that's what it would have found, then resolving it into zz[[1]] (and therefore 11).

Getting back to your case, your variable names have spaces in them. With many base R (and some packages) data-reading functions, they tend to have check.names= or a similarly-purposes argument that will convert a name of a b into a.b, but readxl::read_excel does not do that, so it allows the spaces. While I'm of mixed-opinion on which is the perfect option, I think having spaces enclosed in variable names is a risk for new users. I do like that read_excel returns a tibble, and the presentation of tibbles tends to include (for visual reference if nothing else) backticks around not-legal names. For instance,

readxl::read_excel("Book2.xlsx")
# # A tibble: 1 x 3
#   `a b` `c d`    ef
#   <dbl> <dbl> <dbl>
# 1    11    22    33

which is a clear visual cue that the first two variable names need backtick enclosures.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Yeah, I don't disagree and often choose to not spend the time. Every now and then I get a "pedagogic" mindset. New users may not respond well to the (well-justified) closing of a question, occasionally a little prod in the right direction is very helpful. The risk of providing a full answer is that new users don't feel the incentive to improve their question-asking skills ... I often look back at a new user's most-recent questions to see if there is a trend and bias my response accordingly. Sometimes I am patient; Sometimes the caffeine has kicked in harder than intended ... – r2evans Oct 08 '21 at 16:55
  • @r2evans so for you sir a big thanks for your understanding and your guidance to solve my fault and wrong writing code i made, thanks for patience as you said also , read please the comment i made above to see whats my doubt about `dput(head())`. Lastly thank you again for your long writing answer/solution above and your useful guide! – Nikolas Oct 08 '21 at 22:33
  • `dput(x)` is both perfect and a risk. It's perfect in that its structure is unambiguous in value, class, layout, names, etc. It's a risk because (don't do this) `dput(ggplot2::diamonds)` will flood the screen. The premise (for questions on SO) is that you reduce the size of your data and give us just enough to "play" with. For instance, it doesn't matter to us how you created `merged_data`, we just need to see several rows and several columns. Depending on the order of columns and the variability of rows, it might be sufficient to post `dput(x[1:10,c(1,9)])` into a code-block. Does that help? – r2evans Oct 08 '21 at 22:39