R - Differences using sum with each variable separately or sum with all variables together

Question

I'm becoming mad because I'm using the sum() function and it is showing different results without sense. I have 4 numerical variables: A, B, M, N. Also I have a weights variable: W.

If I make the weighted sum:

sum(df$W * (df$A), na.rm = T) = AR
sum(df$W * df$A, na.rm = T) = AR

The result is the same.

If I add B:

sum(df$W * (df$A + df$B), na.rm = T) = ABR
sum(df$W * df$A, df$W * df$B, na.rm = T) = ABR

The result is the same.

If I add M:

sum(df$W * (df$A + df$B + df$M), na.rm = T) = ABMR1
sum(df$W * df$A, df$W * df$B, df$W * df$M, na.rm = T) = ABMR2

The result become different.

If I add N:

sum(df$W * (df$A + df$B + df$M + df$N), na.rm = T) = ABMNR1
sum(df$W * df$A, df$W * df$B, df$W * df$M, df$W * df$N, na.rm = T) = ABMNR2

The result is different.

So it seems the M and/or N variable have some problem. BUT, if I start adding M and N variable...:

sum(df$W * (df$M), na.rm = T) = MR
sum(df$W * df$M, na.rm = T) = MR

The result is the same.

If I add N:

sum(df$W * (df$M + df$N), na.rm = T) = MNR
sum(df$W * df$M, df$W * df$N, na.rm = T) = MNR

The result is the same.

Now, if I add A:

sum(df$W * (df$M + df$N + df$A), na.rm = T) = MNA1
sum(df$W * df$M, df$W * df$N, df$W * df$A, na.rm = T) = MNA2

The result become different.

If I add B:

sum(df$W * (df$M + df$N + df$A + df$B), na.rm = T) = MNAB1
sum(df$W * df$M, df$W * df$N, df$W * df$A, df$W * df$B, na.rm = T) = MNAB2

The result is different.

Now it seems the problem comes from A or B variables. How is it possible? Is there any difference if I make the sum multiplying the W variable with the sum of the variables (first way), or if I add the different variables (second way)?

Thank you very much for any help you con provide!

I suspect that this is due to `NA`'s in `df$A` and/or in `df$B`, in combination with `na.rm=TRUE`. — RHertel, Feb 25 '20 at 08:45
Hi H.Castells. Welcome to StackOverflow! Please read the info about [how to ask a good question](https://stackoverflow.com/help/how-to-ask) and how to give a [minimale reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). That way you can help others to help you! — dario, Feb 25 '20 at 08:53
It would be great if you provided the data you used, so that the problem is reproducible for other users. Use `dput(data)` function for this and include it in your post. — MKR, Feb 25 '20 at 08:54
How much "different" are the results? Please, provide a reproducible example. — nicola, Feb 25 '20 at 08:56

score 0 · Answer 1 · answered Feb 25 '20 at 09:06

Consider this minimal example:

df<-data.frame(W=c(1,2),A=c(NA,3),B=c(4,NA))

Let's check:

sum(df$W*df$A,na.rm=TRUE)
#[1] 6
sum(df$W*df$B,na.rm=TRUE)
#[1] 4
sum(df$W*df$B,df$W*df$A,na.rm=TRUE)
#[1] 10
sum(df$W*(df$B+df$A),na.rm=TRUE)
#[1] 0

You should figure out what's going on. Hint:

df$W*(df$B+df$A)
#[1] NA NA

score 0 · Accepted Answer · answered Feb 25 '20 at 09:10

This due to the NA. Here's an example illustrating the situation:

x <- c(1,2,NA)
y <- c(1,NA,3)
z <- c(2,3,4)
s1 <- sum(x*(y+z), na.rm = T)
s2 <- sum(x*y,x*z, na.rm = T)

Which yields s1 = 3 and and s2 = 9. The sums, however, are the same if there is no NA. Let's have a look at what happens:

For s1, the sum (y+z) yields a vector 3 NA 7. Multiplied with vector x, one obtains a vector 3 NA NA. Excluding NA's the sum is 3.
For s2 the product x * y yields 1 NA NA, the product x*z gives 2 6 NA. Excluding NAs, the sum of these vectors is 9.

In short, the distributive property known from usual algebra does not hold if NAs are present.

Yeah, that is. I transformed the NA's to 0's and now all the sums give the same result. Thank you very much! — G.Castells, Feb 25 '20 at 10:06

R - Differences using sum with each variable separately or sum with all variables together

2 Answers2