0

I'm becoming mad because I'm using the sum() function and it is showing different results without sense. I have 4 numerical variables: A, B, M, N. Also I have a weights variable: W.

If I make the weighted sum:

sum(df$W * (df$A), na.rm = T) = AR
sum(df$W * df$A, na.rm = T) = AR

The result is the same.

If I add B:

sum(df$W * (df$A + df$B), na.rm = T) = ABR
sum(df$W * df$A, df$W * df$B, na.rm = T) = ABR

The result is the same.

If I add M:

sum(df$W * (df$A + df$B + df$M), na.rm = T) = ABMR1
sum(df$W * df$A, df$W * df$B, df$W * df$M, na.rm = T) = ABMR2

The result become different.

If I add N:

sum(df$W * (df$A + df$B + df$M + df$N), na.rm = T) = ABMNR1
sum(df$W * df$A, df$W * df$B, df$W * df$M, df$W * df$N, na.rm = T) = ABMNR2

The result is different.

So it seems the M and/or N variable have some problem. BUT, if I start adding M and N variable...:

sum(df$W * (df$M), na.rm = T) = MR
sum(df$W * df$M, na.rm = T) = MR

The result is the same.

If I add N:

sum(df$W * (df$M + df$N), na.rm = T) = MNR
sum(df$W * df$M, df$W * df$N, na.rm = T) = MNR

The result is the same.

Now, if I add A:

sum(df$W * (df$M + df$N + df$A), na.rm = T) = MNA1
sum(df$W * df$M, df$W * df$N, df$W * df$A, na.rm = T) = MNA2

The result become different.

If I add B:

sum(df$W * (df$M + df$N + df$A + df$B), na.rm = T) = MNAB1
sum(df$W * df$M, df$W * df$N, df$W * df$A, df$W * df$B, na.rm = T) = MNAB2

The result is different.

Now it seems the problem comes from A or B variables. How is it possible? Is there any difference if I make the sum multiplying the W variable with the sum of the variables (first way), or if I add the different variables (second way)?

Thank you very much for any help you con provide!

RHertel
  • 23,412
  • 5
  • 38
  • 64
G.Castells
  • 13
  • 4
  • 2
    I suspect that this is due to `NA`'s in `df$A` and/or in `df$B`, in combination with `na.rm=TRUE`. – RHertel Feb 25 '20 at 08:45
  • Hi H.Castells. Welcome to StackOverflow! Please read the info about [how to ask a good question](https://stackoverflow.com/help/how-to-ask) and how to give a [minimale reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). That way you can help others to help you! – dario Feb 25 '20 at 08:53
  • It would be great if you provided the data you used, so that the problem is reproducible for other users. Use `dput(data)` function for this and include it in your post. – MKR Feb 25 '20 at 08:54
  • How much "different" are the results? Please, provide a reproducible example. – nicola Feb 25 '20 at 08:56

2 Answers2

0

Consider this minimal example:

df<-data.frame(W=c(1,2),A=c(NA,3),B=c(4,NA))

Let's check:

sum(df$W*df$A,na.rm=TRUE)
#[1] 6
sum(df$W*df$B,na.rm=TRUE)
#[1] 4
sum(df$W*df$B,df$W*df$A,na.rm=TRUE)
#[1] 10
sum(df$W*(df$B+df$A),na.rm=TRUE)
#[1] 0

You should figure out what's going on. Hint:

df$W*(df$B+df$A)
#[1] NA NA
nicola
  • 24,005
  • 3
  • 35
  • 56
0

This due to the NA. Here's an example illustrating the situation:

x <- c(1,2,NA)
y <- c(1,NA,3)
z <- c(2,3,4)
s1 <- sum(x*(y+z), na.rm = T)
s2 <- sum(x*y,x*z, na.rm = T)

Which yields s1 = 3 and and s2 = 9. The sums, however, are the same if there is no NA. Let's have a look at what happens:

  1. For s1, the sum (y+z) yields a vector 3 NA 7. Multiplied with vector x, one obtains a vector 3 NA NA. Excluding NA's the sum is 3.
  2. For s2 the product x * y yields 1 NA NA, the product x*z gives 2 6 NA. Excluding NAs, the sum of these vectors is 9.

In short, the distributive property known from usual algebra does not hold if NAs are present.

RHertel
  • 23,412
  • 5
  • 38
  • 64
  • Yeah, that is. I transformed the NA's to 0's and now all the sums give the same result. Thank you very much! – G.Castells Feb 25 '20 at 10:06