0

I am kind of new to R and have some issues generating a dummy variable by evaluating a number of conditions.

I am trying to create the dummy variable 'GRDUMMY'. GRDUMMY should take the value 1 if:

- SG_MA > SG_MA_Year_Avg & LIQ < LIQ_Year_Avg

Otherwise, it should take value 0.

One complicating issue I have is that I have missing values in both SG_MA and LIQ (although not in SG_MA_Year_Avg and LIQ_Year_Avg).

To generate the dummy variable and handle these issues, I have tried the following code:

for(i in 1:nrow(Merge_GRDUMMY)){
  if(is.na(Merge_GRDUMMY$SG_MA[i])){
    Merge_GRDUMMY$GRDUMMY <- "NA"
    }else if(is.na(Merge_GRDUMMY$LIQ[i])){
      Merge_GRDUMMY$GRDUMMY <- "NA"
    }else if(Merge_GRDUMMY$SG_MA[i] > Merge_GRDUMMY$SG_MA_Year_Avg[i] & Merge_GRDUMMY$LIQ[i] < Merge_GRDUMMY$LIQ_Year_avg[i]){
      Merge_GRDUMMY$GRDUMMY <- 1
    }else{
      Merge_GRDUMMY$GRDUMMY <- 0}
}

Sample data:

> dput(Merge_GRDUMMY[1:4, c(14, 16, 21, 22)])
structure(list(SG_MA = c(NA_real_, NA_real_, NA_real_, NA_real_
), LIQ = c(-0.166091210233936, -0.238975053258208, -0.0423391360788804, 
-0.0255328112422608), SG_MA_Year_Avg = c(NaN, NaN, NaN, NaN), 
    LIQ_Year_avg = c(-0.0460118085010656, -0.0460118085010656, 
    -0.0460118085010656, -0.0460118085010656)), row.names = c(NA, 
4L), class = "data.frame")

My problem is, it seems the above loop executes all statements and thus assigns value "0" to all observations, even those with missing values. Any tips on what I am doing wrong?

Many thanks!

Dario
  • 3
  • 2
  • 1
    Hello Dario, welcome to SO ! Could you help use help you by providing a reproducible example ? You could share your data, or part of it, using `dput(your_data)`. You can find a great documentation for sharing code [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – cbo Mar 02 '20 at 09:47
  • Thanks for the tip cbo, I have now added some sample data! – Dario Mar 02 '20 at 10:05
  • 1
    Please check your dummy data. `SG_MA` and `SG_MA_Year_Avg` are **all** `NA`... – dario Mar 02 '20 at 10:18

2 Answers2

0

It's always faster and more readable to make use of vectorized functions in R. ifelse is the vectorized version of if.

Since you did not post a minimal reproducible example I couldn't verify the answer, but this should help you out:

Merge_GRDUMMY$GRDUMMY <- ifelse(is.na(Merge_GRDUMMY$SG_MA) | is.na(Merge_GRDUMMY$LIQ), NA,
                                ifelse(-Merge_GRDUMMY$SG_MA > Merge_GRDUMMY$SG_MA_Year_Avg & Merge_GRDUMMY$LIQ < Merge_GRDUMMY$LIQ_Year_Avg, 1, 0))
dario
  • 6,415
  • 2
  • 12
  • 26
0

An other way to do so with dplyr :

suppressPackageStartupMessages( library(dplyr) )

set.seed(123)

dfr <- tibble(
        SG_MA = c(rnorm(10), NA),
        SG_MA_Year_Avg = rnorm(11),
        LIQ = c(NA, rnorm(10)),
        LIQ_Year_Avg = rnorm(11)
)
# dfr

dfr %>% mutate(indic = case_when(is.na(SG_MA) | is.na(LIQ) ~ NA_real_,
                                 SG_MA > SG_MA_Year_Avg & LIQ < LIQ_Year_Avg ~ 1,
                                 TRUE ~ 0
))
#> # A tibble: 11 x 5
#>      SG_MA SG_MA_Year_Avg    LIQ LIQ_Year_Avg indic
#>      <dbl>          <dbl>  <dbl>        <dbl> <dbl>
#>  1 -0.560           1.22  NA          -0.295     NA
#>  2 -0.230           0.360 -0.218       0.895      0
#>  3  1.56            0.401 -1.03        0.878      1
#>  4  0.0705          0.111 -0.729       0.822      0
#>  5  0.129          -0.556 -0.625       0.689      1
#>  6  1.72            1.79  -1.69        0.554      0
#>  7  0.461           0.498  0.838      -0.0619     0
#>  8 -1.27           -1.97   0.153      -0.306      0
#>  9 -0.687           0.701 -1.14       -0.380      0
#> 10 -0.446          -0.473  1.25       -0.695      0
#> 11 NA              -1.07   0.426      -0.208     NA
cbo
  • 1,664
  • 1
  • 12
  • 27