0

Data set "dat" looks like this:

**V1  V2**  
1   2
2   2
3   5
9   8
9   9 
a   2

Want to create dummy variable V3:

  1. if V1=V2, 0
  2. otherwise, within a range 1-8

Where 8+ is involved, or any symbol or letter, the variable should read NA. In the above example, the

V3 = {0,1,0,NA,NA,NA}
Dinidu Hewage
  • 2,169
  • 6
  • 40
  • 51
  • use `dput` to create reproducible example http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. What did you try so far? – Bulat Mar 12 '17 at 12:32
  • 1
    read about `?ifelse` – Bulat Mar 12 '17 at 12:33
  • 1
    I would do something like `library(data.table) ; setDT(df)[as.numeric(as.character(V1)) < 8 & as.numeric(as.character(V2) < 8), V3 := +(V1 == V2)]` because `data.table` allows easy manipulations of subsets. Regardless, you seem to have bad data that needs to be fixed first. An R vector doesn't allow mixed types, hence `1` is not really `1` when you also have `a` in the same vector and it could be either `"1"` (which is not the same!) or in case of a factor vector, you can get very unexpected results. I would suggest you first fix your data before you proceed to any kind of analysis. – David Arenburg Mar 12 '17 at 13:02

2 Answers2

0

There are many ways to do this. This one has a loop which checks each row and based on a set of rules, returns whatever you want. This is easily extendable for more complex rules. Warnings can be ignored as they are produced when "a" is being coerced to numeric.

x <- read.table(text = "1   2
2   2
3   5
9   8
9   9 
a   2", header = FALSE)

x$V3 <- apply(x, MARGIN = 1, FUN = function(m) {
  xm <- as.numeric(as.character(m))

  if (!any(is.na(xm))) {
    if (any(xm > 8)) {
      return(NA)
    }
    if(xm[1] == xm[2]) {
      return(1)
    } else {
      return(0)
    }
  } else {
    return(NA)
  } 
})

  V1 V2 V3
1  1  2  0
2  2  2  1
3  3  5  0
4  9  8 NA
5  9  9 NA
6  a  2 NA
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
0

This would be one of the many ways it can be done. There might be some more efficient ways:

# Create the original dataset
data <- data.frame(V1 = c(1,2,3,9,9,"a"), V2 = c(2,2,5,8,9,2))
# Check if V1 == V2 and write the result to V3 for ALL observations
data$V3 <- data$V1 == data$V2
# Where V1 or V2 are not in the range [1,8], overwrite V3 with NA
data$V3[!(grepl("\\b[12345678]\\b", data$V2) &
                grepl("\\b[12345678]\\b", data$V1))] <- NA

Where the "\\b[12345678]{1,1}\\b" can be decomposed as follows:

1) the [12345678] part check, if the string contains some number from the range 1:8.

2) the \bb ... \bb part gives you the word boundary - thus number 2 will be matched, but number 28 will not.

If you wanted to match a range 0:13, you would adjust the regular expression like this:

data$V3[!(grepl("\\b([0-9]|1[0-3])\\b", data$V2) &
                grepl("\\b([0-9]|1[0-3])\\b", data$V1))] <- NA

Where the \\b([0-9]|1[0-3])\\b can be translated as follows:

1) [0-9] matches numbers 0:9

2) 1[0-3] matches numbers 10:13

3) [0-9]|1[0-3] tells you that numbers 0:9 or 10:13 should be matched

4) \b...\b gives you the word boundaries

5) (...) tells you that the word boundaries should be evaluated after the expression within brackets. Without the brackets, this would be equivalent operation: \\b[0-9]\\b|\\b1[0-3]\\b

For more detailed introduction into matching numeric ranges with regular expression see this link: http://www.regular-expressions.info/numericranges.html

ira
  • 2,542
  • 2
  • 22
  • 36