0

for now I have not added a data sample as this does not seem to be of relevance...

I want to perform:

sum(df$y[df$x4 %in% c("1.A", "1.B", "1.C", "1.D") & x == "1990" & x2 == 
"Austria" & x1!="All greenhouse gases - (CO2 equivalent)"])
==sum(df$y[df$x4 in% c("1") & df$x == "1990" & df$x2 == "Austria" & x1!="All 
greenhouse gases - (CO2 equivalent)"])

Expected outcome: "TRUE"

When performing

sum(df$y[df$x4 %in% c("1.A", "1.B", "1.C", "1.D") & x == "1990" & x2 == 
"Austria" & x1!="All greenhouse gases - (CO2 equivalent)"])

I get [1] 51347.52

When performing the second operation I get the same.

sum(df$y[df$x4 in% c("1") & df$x == "1990" & df$x2 == "Austria" & x1!="All 
greenhouse gases - (CO2 equivalent)"])

So far so good.

However, when performing the operation stated at the beginning of this post, I get "FALSE", which is not true as tested by running both operation separately. How can this be?

The data source has five decimal places, but can this really be due to this?

Thanks in any case....

Nordsee
  • 81
  • 1
  • 10
  • *"The data source has five decimal places, but can this really be due to this?"* yes, almost certainly. – Gregor Thomas Oct 04 '18 at 13:10
  • 1
    Instead of testing equality, perform subtraction and look at the difference. If the difference is on the order of `1e-16` or something, then they're basically equal. If the difference is bigger, then it seems your assumption of equality is incorrect – Gregor Thomas Oct 04 '18 at 13:12
  • At a glance, the other possible place I see for errors is that you are inconsistent in referring to columns as `df$x` or just `x`. On your left hand side you refer to `df$x4`, but also `x`, `x1`, and `x2` without the `df$`. On the right hand side you use `df$x4`, `df$x`, `df$x2` and `x1`. This makes it seem like maybe you've `attach`ed your data frame, which means that the values in, e.g., `df$x2` and `x2` may or may not be in sync still, depending on what operations you've run since `attach`ing. Common advice is to never use `attach`. – Gregor Thomas Oct 04 '18 at 13:15
  • 1
    Wow, fascinating read! Thanks, @Gregor! – Roman Oct 04 '18 at 13:20
  • @Gregor Thank you for your help. I have edited the coding, but the problem persists. However, you are right, the error is due to the decimal places. I am getting [1] -9.999996e-06 when subtracting. Is there any way to tell the code not ignore everything after the 2nd decimal place? – Nordsee Oct 04 '18 at 13:23
  • I'd suggest reading the answers at the duplicate that I marked, there are lots of practical suggestions there. You can use `round()` with whatever precision you want, you can subtract and test that `abs(x - y) < my_threshold`... – Gregor Thomas Oct 04 '18 at 13:25
  • @Gregor Thanks again and sorry for the duplicate. My search terms didn't lead me to relevant existing topics – Nordsee Oct 04 '18 at 13:27
  • @Gregor: I have tried this: round(sum(df$y[df$x4 %in% c("1.A", "1.B", "1.C", "1.D") & df$x == "1990" & df$x2 == "Austria" & df$x1!="All greenhouse gases - (CO2 equivalent)"], digits=2)) Instead of rounding the result to two decimal places, R drops the decimal places and increases the actual output by 2 (i.e. 51348.22 becomes 51350). How can this be? – Nordsee Oct 04 '18 at 13:40
  • No worries about the dupe, that's why it's easy to mark as a dupe. And now your question will stand as another pointer to that dupe. Floating point precision problems are something everyone encounters sooner or later, and it seems very counterintuitive at first. – Gregor Thomas Oct 04 '18 at 13:41
  • @Nordsee check your syntax. Make sure you are giving `digits = 2` to `round()` not to `sum()`. Count your parens. Glancing at your code above which is quite hard to read, you'd do well to format your code better. Use line breaks and indentation when you have long lines and nesting. RStudio does a pretty good job auto-formatting if you use the Code/Reformat Code menu selection. – Gregor Thomas Oct 04 '18 at 13:42
  • @Gregor This must be my afternoon low. Thanks again for your help. All is working now! – Nordsee Oct 04 '18 at 13:47

1 Answers1

0

edit: this should work

Building data frame

df <- data.frame(x = 1990, 
                 x2 = "Austria", 
                 x4 = c("1.A", "1.B", "1.C", "1.D", "1.E", 
                        "1.F", "1.G", "1.H", "1.I", "1.J"), 
                 y = c(0.000000000001:0.000000000010), 
                 x1 = "Yearly gases")

df
      x      x2  x4     y           x1
1  1990 Austria 1.A 1e-12 Yearly gases
2  1990 Austria 1.B 1e-12 Yearly gases
3  1990 Austria 1.C 1e-12 Yearly gases
4  1990 Austria 1.D 1e-12 Yearly gases
5  1990 Austria 1.E 1e-12 Yearly gases
6  1990 Austria 1.F 1e-12 Yearly gases
7  1990 Austria 1.G 1e-12 Yearly gases
8  1990 Austria 1.H 1e-12 Yearly gases
9  1990 Austria 1.I 1e-12 Yearly gases
10 1990 Austria 1.J 1e-12 Yearly gases

Result of sums

# Formula 1
sum(df$y[df$x4 %in% c("1.A", "1.B", "1.C", "1.D") &
             df$x == "1990" &
             df$x2 == "Austria" &
             df$x1 != "All greenhouse gases - (CO2 equivalent)"]) 
[1] 4e-12


# Formula 2
sum(df$y[df$x4 %in% c("1") &
             df$x == "1990" &
             df$x2 == "Austria" &
             df$x1 != "All greenhouse gases - (CO2 equivalent)"])
[1] 0

Result with ==

sum(df$y[df$x4 %in% c("1.A", "1.B", "1.C", "1.D") &
             df$x == "1990" &
             df$x2 == "Austria" &
             df$x1 != "All greenhouse gases - (CO2 equivalent)"]) ==
sum(df$y[df$x4 %in% c("1") &
             df$x == "1990" &
             df$x2 == "Austria" &
             df$x1 != "All greenhouse gases - (CO2 equivalent)"])
[1] FALSE

Result with all.equal

isTRUE(all.equal(sum(df$y[df$x4 %in% c("1.A", "1.B", "1.C", "1.D") &
                              df$x == "1990" &
                              df$x2 == "Austria" &
                              df$x1 != "All greenhouse gases - (CO2 equivalent)"]),
                 sum(df$y[df$x4 %in% c("1") &
                              df$x == "1990" &
                              df$x2 == "Austria" & 
                              df$x1 != "All greenhouse gases - (CO2 equivalent)"]))
       )
[1] TRUE
Roman
  • 4,744
  • 2
  • 16
  • 58