2

In the following, logical operators don't seem to work properly.

a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)

wtf = data.frame(a, b, c, d)
wtf$huh = apply(wtf, 1, function(row) {
    if (row['a'] == T) { return('we win') }
    if (row['c'] < 5) { return('hooray') }
    if (row['d'] == 1) { return('a thing') }
    return('huh?')
})

Producing:

> wtf
      a  b c d     huh
1  TRUE  a 1 0  hooray
2 FALSE  b 2 0  hooray
3  TRUE  c 3 0  hooray
4 FALSE de 4 0  hooray
5  TRUE  f 5 0    huh?
6  TRUE  g 6 1 a thing

Where naively one would expect that in rows 1, 3, 5, and 6, there would be we win.

Can someone explain to me (1) why it does this, (2) how can this be fixed such that it doesn't happen, (3) why all my logical columns are seemingly changed to characters, and (4) how can a function be type-safely applied to rows in a data frame?

ifly6
  • 5,003
  • 2
  • 24
  • 47
  • 3
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Using `apply` with data.frames is not a good idea because it coerces to a matrix first which can change all your data types. – MrFlick Apr 18 '18 at 20:21
  • Agree with @MrFlick the problem is almost certainly the use of `apply`. – joran Apr 18 '18 at 20:23
  • So how can this be fixed such that it doesn't happen? – ifly6 Apr 18 '18 at 20:26
  • There are some slick tools for operating on data frames by row in **purrr**, but frankly a simple for loop would be a fine place to start. – joran Apr 18 '18 at 20:30
  • To be clear, `apply` is working correctly, it's just that it's correct behavior is confusing. `apply` literally coerces your data frame to a matrix. A matrix can only contain a single data type. Hence, typically everything will become characters (or whatever is the most common thing it can coerce to). – joran Apr 18 '18 at 20:32

3 Answers3

7

Why does this happen? Because is apply is made for matrices. When you give it a data frame, then the first thing that happens is it gets converted to a matrix:

m = as.matrix(wtf)
m 
#      a       b    huh    huh1    
# [1,] " TRUE" "a"  "huh?" "hooray"
# [2,] "FALSE" "b"  "huh?" "huh?"  
# [3,] " TRUE" "c"  "huh?" "hooray"
# [4,] "FALSE" "de" "huh?" "huh?"  
# [5,] " TRUE" "f"  "huh?" "hooray"
# [6,] " TRUE" "g"  "huh?" "hooray"

When that happens, your different data types are lost and your data frame-style indexing doesn't work anymore:

m['a']
# [1] NA

Solution? Use a simple for loop:

wtf$huh1 = NA
for (i in 1:nrow(wtf)) {
        wtf$huh1[i] = if(wtf[i, 'a']) "hooray" else "huh?"
}

If you have a function foo then

wtf$huh2 = NA
for (i in 1:nrow(wtf)) {
        wtf$huh1[i] = foo(wtf[i, 'a'])
}

Functions that aren't vectorized can be vectorized to avoid the need for loops:

foov = Vectorize(foo)
# then you can
wtf$huh4 = foov(wtf$a)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

Probably the easiest way to fix this is using ifelse which is vectorized, so you don't need to deal with loops, or apply:

myfunc <- function(row) {
     ifelse (row['a'] == T,'hooray','huh?')
 }

wtf$huh <- myfunc(wtf)

      a  b      a
1  TRUE  a hooray
2 FALSE  b   huh?
3  TRUE  c hooray
4 FALSE de   huh?
5  TRUE  f hooray
6  TRUE  g hooray
jeremycg
  • 24,657
  • 5
  • 63
  • 74
1

One advantage of a data.frame is that they can contain variables of different types of variables.

    lapply(wtf, typeof)
    $a
    [1] "logical"

    $b
    [1] "factor"

    $huh
    [1] "character"

As noted by Gregor, apply requires a matrix and will convert the object you give it to one if possible. But matrices cannot contain multiple variable types and so as.matrix will look for a lowest common denominator that can represent the data, in this case, character.

    typeof(as.matrix(wtf))    
    [1] "character"

    class(as.matrix(wtf))    
    [1] "matrix"
mohanty
  • 78
  • 7