-1

I'm trying to check my math where I am adding two columns to create a new column using this per this article:

df$TotalAnimalMathCorrect <- sapply(df$TotalAnimals, identical, df$TotalFemales+df$TotalMales)) 

I am looking for any FALSE values that would indicate that my summation isn't working right.

I calculate female and male animals using this:

df$TotalMales <- apply(subset(df, select = c(Gender.1,Gender.2,Gender.3,Gender.4)), 1, function(x) length(which(x=="Male")))

#convert to a numeric variable
quote_data_in$TotalMales<- as.numeric(quote_data_in$TotalMales)

and

df$TotalFemales <- apply(subset(df, select = c(Gender.1,Gender.2,Gender.3,Gender.4)), 1, function(x) length(which(x=="Female")))

#convert to a numeric variable
quote_data_in$TotalFemales<- as.numeric(quote_data_in$TotalFemales)

When I look at the data, I can see that I am adding correctly but since I have 170,000 rows, I'd like to do a double check by seeing if the TotalAnimals always equals the sum of the Female and Male animals.

But ... I am always getting FALSE for all values in my df$TotalAnimalMathCorrect, even when I can see that 1+1 = 2, the value in df$TotalAnimalMathCorrect.

I've also checked and confirmed that all three columns are numeric, and had applied an as.numeric before adding the numbers as you can see above and here

> str(df$TotalMales)
 num [1:16929] 1 0 0 1 0 0 0 0 0 0 ...
> str(df$TotalFemales)
 num [1:16929] 0 1 1 0 1 0 2 1 1 0 ...
> str(df$TotalAnimals)
 num [1:16929] 1 1 1 1 1 1 2 1 1 1 ...

I also tried converting the variables to integer with as.integer instead of as.numeric, to be more specific but still every row has a FALSE for the TotalAnimalMathCorrect column.

Any ideas as to why the identical call isn't giving a TRUE when the numbers clearly match? I read the documentation on identical here

Here's some sample data of what I expect:

> TotalFemales    TotalFemales    TotalAnimals    TotalAnimalMathCorrect
> 1               1               2               TRUE

but, like I said, I'm getting this:

TotalFemales    TotalFemales    TotalAnimals    TotalAnimalMathCorrect
1               1               2               FALSE

Here is reproducible code.

df<- data.frame(TotalMales=c(1,1,0),TotalFemales=c(1,0,0),TotalAnimals=c(2,1,0)) 

  TotalMales TotalFemales TotalAnimals
1          1            1            2
2          1            0            1
3          0            0            0

Thanks very much!

Community
  • 1
  • 1
Jazzmine
  • 1,837
  • 8
  • 36
  • 54
  • I will provide an reproducible data set shortly. – Jazzmine Dec 21 '16 at 21:29
  • 1
    Can't follow along since you haven't shared any reproducible example. (So you should share a *small* **reproducible example** - [see here for best practices (use `dput` or share simulation code)](http://stackoverflow.com/q/5963269/903061). – Gregor Thomas Dec 21 '16 at 21:30
  • 1
    Maybe it's because I can't follow along, but why are you use `sapply`? Why not just `identical(df$TotalAnimals, df$TotalFemales + df$TotalMales)`? Or maybe `all(df$TotalAnimals == df$TotalFemales + df$TotalMales)`? Are you calculating these integers in a weird way that might lead to precision problems? – Gregor Thomas Dec 21 '16 at 21:34
  • I found an example using sapply as an example. I also saw in another question that identical is much faster than all. I also tried your suggestion but same result, all FALSE. – Jazzmine Dec 21 '16 at 21:41
  • 1
    `all` may be slower than `identical`, but `all(z==z)` where `z` is a vector of length 170,000 takes approximately 0.002 seconds (!) on my laptop ... – Ben Bolker Dec 21 '16 at 21:57

1 Answers1

2

Your problem is that

sapply(df$TotalAnimals, identical, df$TotalFemales+df$TotalMales)

does not match TotalAnimals with TotalFemales+TotalMales element-by-element; rather, it takes each element of TotalAnimals and compares it to the entire TotalFemales+TotalMales vector ... i.e., it does the equivalent of

 identical(df$TotalAnimals[1],df$TotalFemales+df$TotalMales)
 identical(df$TotalAnimals[2],df$TotalFemales+df$TotalMales)
 ...

Each of these comparisons gives FALSE because it is comparing a length-1 numeric vector to a length-N numeric vector (where N is the number of rows of df).

with(df,identical(TotalAnimals, TotalFemales+TotalMales))

should work fine. Another alternative, if you don't need to worry about NA values, is

with(df,TotalAnimals==TotalFemales+TotalMales)

doing it this way (vectorized element-by-element) will help if you want to check which elements differ ...

(I would typically include the line

stopifnot(identical(df$TotalAnimals,df$TotalFemales+df$TotalMales))

in my code to stop with an error if there's a problem.)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I see, I thought it was a rowwise type function. I see in the documentation it says "The safe and reliable way to test two objects for being exactly equal. It returns TRUE in this case, FALSE in every other case.". I guess object is the entire dataframe, not a row. And yes, this worked exactly as I was seeking. Thanks for your help Ben. – Jazzmine Dec 21 '16 at 21:52