1

I have a dataframe (sy2.1) with two columns (Vannstand2Positive and Vannstand2Negative) that both contain NA's as a result of mutating the columns from an existing column. I also have a third column (t) with time and date, and a fourth column (Kategori) with 4 categories. I want to find the percentage of each of the 4 categories with values from Vannstand2Positive and Vannstand2Negative.

This is the code I have used, in which I only use the column with positive values (Vannstand2Positive) in this code.

sy2.1 %>% 
  filter(year(t)==2005) %>% 
  group_by(Kategori, year(t)) %>% 
  dplyr::summarise(sum_vannstandPos = sum(Vannstand2Positive)) %>% 
  dplyr::mutate(percent_vannstandPos=sum_vannstandPos/sum(sum_vannstandPos)*100)


#    Kategori year(t) sum_VannstandPos percent_VannstandPos
# 1:  Liten    2005           NA                 NA
# 2: Moderat   2005          1107               100
# 1:  Stor     2005           609               100
# 2: Størst    2005          1107               100

There is clearly something wrong with the code, as the percentage gives 100% on all the categories, but I need the correct calculations of the percentage. I tried to also use it on the column with negative values (Vannstand2Negative), but I only got NA's on the table. But I know its fully possible to find percentages of negative values. Anyone with a code that can fix this?

By adding na.rm=T to the sum function, I got percentages of almost all groups except the first group:

#    Kategori year(t) sum_VannstandPos percent_VannstandPos
# 1:  Liten    2005           NA                 NA
# 2: Moderat   2005          1107               32.34005
# 1:  Stor     2005           609               17.79141
# 2: Størst    2005          1107               49.86854

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Saron B
  • 39
  • 6
  • try dropping the grouping after `summarise` with `ungroup` – Quixotic22 Jan 05 '22 at 10:05
  • Tried adding that to my code, but that resulted in only NA's in the "percent_VannstandPos" column. I do have NA's in the column I am using values from, will that affect the output here? The NA's in that column replaces missing rows.. – Saron B Jan 05 '22 at 10:16
  • `na.rm = T` needs to be added the `sum` function to remove NA values from the calculation – Quixotic22 Jan 05 '22 at 10:36
  • Thanks for that! Did not know how to remove those NA's – Saron B Jan 05 '22 at 11:17

2 Answers2

0

Adding ungroup() as in the reproducab le example below.

iris %>% 
  group_by(Species) %>% 
  summarize(sum_Petal.Length = sum(Petal.Length, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(
    precentage = sum_Petal.Length / sum(sum_Petal.Length, na.rm = TRUE)
  )
Jochem
  • 3,295
  • 4
  • 30
  • 55
  • Tried adding that to my code, but that resulted in only NA's in the "percent_VannstandPos" column. I do have NA's in the column I am using values from, will that affect the output here? The NA's in that column replaces missing rows.. – Saron B Jan 05 '22 at 10:16
  • 1
    Would be easier if you can add a sample of your data with `dput(head(sy2.1, 10))` – AlexB Jan 05 '22 at 10:22
  • Okey will do! But how can I add the output here? New to this @AlexB – Saron B Jan 05 '22 at 10:31
  • Please edit your post by adding the output. You can take a look at https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – AlexB Jan 05 '22 at 10:44
  • Thanks! Need to learn it properly, I know adding dataframes showing the actual data really helps others answering my questions! – Saron B Jan 05 '22 at 11:18
  • @SaronB: I have added the code `na.rm = TRUE` to the solution. This will mitigate the NAs you may have in the data you try to sum. – Jochem Jan 05 '22 at 12:30
0

Use na.rm = TRUE to drop the NA values from the calculation in sum. You can use .groups = 'drop' in summarise to drop the groupings and since you are using year(t) more than once it is better to create a new column with the year information.

library(dplyr)
library(lubridate)

result <- sy2.1 %>% 
  mutate(year = year(t)) %>%
  filter(year==2005) %>% 
  group_by(Kategori, year) %>% 
  dplyr::summarise(sum_vannstandPos = sum(Vannstand2Positive, na.rm = TRUE), 
                  .groups = 'drop') %>% 
  dplyr::mutate(percent_vannstandPos= prop.table(sum_vannstandPos)*100)

result

prop.table(sum_vannstandPos) is another way to write sum_vannstandPos/sum(sum_vannstandPos).

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks! This seemed to work!! Do you know how I could simply view that directly in a boxplot @RonakShah ? – Saron B Jan 05 '22 at 11:33