0

First of all I have checked the existing topics. Unfortunately, they are either not exactly relevant or I am not able to understand them. As you will know from my type of question, I'm VERY new to R. I hope this is okay...

I feel I am on the right way....

here https://i.stack.imgur.com/5jv0m.jpg is an excerpt of the dataframe (df)

I want to compare whether the values of the subcategories of emissions (y) sum up the values stated in the parent categories. Part of this is summing up the values of subcategories.

In short I want to know whether sum(3.B.1+3.B.2+...+3.B.n) = 3.B. (i.e. the in the csv stated sum) for a given year and country. I want to verify the sums.

I've tried this code (with 2010 and Austria):

sum(compare_df, x4 %in% c("1.A.1", "1.A.2", "1.A.3", "1.A.4", "1.A.5") & x 
== "2010" & x2 == "Austria")

but get this:

Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables

After having this, is there a way to run a code which will automate the process of running code for other conditions (i.e. list of countries and years)? You some keywords would be helpful here. I could then search for it myself.

I hope my question is clear enough and thank you for any sort of help or suggestion. Sorry for such a long post...

PS: I've updated everything know and hope my question is more clear.

Nordsee
  • 81
  • 1
  • 10
  • 1
    It would be helpful if you provide some sample data, using `dput` . Same goes for expected output. – Wimpel Oct 01 '18 at 17:59
  • 1
    In your `sum` function you have `x=` and `x2=` but you should be using `==` and not `=` for subsetting on conditions. Need data to help further – Mike Oct 01 '18 at 19:21
  • Welcome to SO. Please, provide a [mcve]. Thank you. – Uwe Oct 02 '18 at 08:30
  • What are you aiming at? Just summarising for a particular subset of your dataset, or aggregating for all groups? – Uwe Oct 02 '18 at 08:32
  • @Wimpel I have added some information and clarification. Thanks for looking at my problem! – Nordsee Oct 02 '18 at 11:34
  • @Uwe I hope this helps – Nordsee Oct 02 '18 at 11:34
  • @Mike I hope this helps – Nordsee Oct 02 '18 at 11:34
  • 1
    please do not post pictures of your data. post the data instead: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#5963610, and https://www.r-bloggers.com/three-tips-for-posting-good-questions-to-r-help-and-stack-overflow/ – Wimpel Oct 02 '18 at 11:53
  • @Wimpel thank you for this, but I'm also facing problems here. As my data has many levels, I have tried Data <- read.table(text=df, header = TRUE), but this gets rejected "Error in textConnection(text, encoding = "UTF-8") : invalid 'text' argument" – Nordsee Oct 02 '18 at 12:18

2 Answers2

0

Hard to be sure without knowing what compare_df looks like but here is a possible solution using dplyr which is great for working with data frames.

The %>% operator is the 'pipe' which takes the results of the previous function and inserts them into the first argument of the subsequent function.

All of the dplyr functions (filter, group_by, summarize, etc) take the data as the first function argument so it works nicely with %>%.

library(dplyr)

compare_df %>% 
     filter(x4 %in% c("1.A.1", "1.A.2", "1.A.3", "1.A.4", "1.A.5"))
     group_by(x, x2) %>% 
     summarize(sum_emmissions = sum(y, na.rm = TRUE)) %>% 
     filter(x == "2010", x2 == "Austria")
TBT8
  • 766
  • 1
  • 6
  • 10
  • Thank you, TBT8, I have updated/ clarified the information in my initial post. Could you have a look, please? – Nordsee Oct 02 '18 at 11:37
0

If you want to verify the sums of the y variable you need to specify which variable you want to sum. Currently your sum statement is trying to sum the whole data.frame and when it encounters a categorical variable it throws the error

Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables

I didn't reproduce your code but this can be verified by sum(iris). If you truly want to sum all numeric variables you would have to do this sum(iris[sapply(iris,is.numeric)]).

But to get to your question about subsetting on three variables you would have to do something like this:

sum(iris$Sepal.Length[iris$Species %in% c("setosa","versicolor") &
                        iris$Sepal.Width >= 3 &
                        iris$Petal.Length >= 2])

First you have to tell sum what data.frame and variable you want to sum over e.g(the iris$Sepal.Length part of the code - this would be your df$y) then with [ you need to subset on the variables of interest. In your code when you refer variables without the df$ notation R will not find those variables because they are not objects but rather part of the data.frame. Hope this helps.

Also in your post your year variable is a numeric and not a categorical variable so you should remove the quotes around 2010.

Mike
  • 3,797
  • 1
  • 11
  • 30