How can I find the percentage of different groups within a column?

Question

I have a dataframe (sy2.1) with two columns (Vannstand2Positive and Vannstand2Negative) that both contain NA's as a result of mutating the columns from an existing column. I also have a third column (t) with time and date, and a fourth column (Kategori) with 4 categories. I want to find the percentage of each of the 4 categories with values from Vannstand2Positive and Vannstand2Negative.

This is the code I have used, in which I only use the column with positive values (Vannstand2Positive) in this code.

sy2.1 %>% 
  filter(year(t)==2005) %>% 
  group_by(Kategori, year(t)) %>% 
  dplyr::summarise(sum_vannstandPos = sum(Vannstand2Positive)) %>% 
  dplyr::mutate(percent_vannstandPos=sum_vannstandPos/sum(sum_vannstandPos)*100)


#    Kategori year(t) sum_VannstandPos percent_VannstandPos
# 1:  Liten    2005           NA                 NA
# 2: Moderat   2005          1107               100
# 1:  Stor     2005           609               100
# 2: Størst    2005          1107               100

There is clearly something wrong with the code, as the percentage gives 100% on all the categories, but I need the correct calculations of the percentage. I tried to also use it on the column with negative values (Vannstand2Negative), but I only got NA's on the table. But I know its fully possible to find percentages of negative values. Anyone with a code that can fix this?

By adding na.rm=T to the sum function, I got percentages of almost all groups except the first group:

#    Kategori year(t) sum_VannstandPos percent_VannstandPos
# 1:  Liten    2005           NA                 NA
# 2: Moderat   2005          1107               32.34005
# 1:  Stor     2005           609               17.79141
# 2: Størst    2005          1107               49.86854

Tried adding that to my code, but that resulted in only NA's in the "percent_VannstandPos" column. I do have NA's in the column I am using values from, will that affect the output here? The NA's in that column replaces missing rows.. — Saron B, Jan 05 '22 at 10:16
`na.rm = T` needs to be added the `sum` function to remove NA values from the calculation — Quixotic22, Jan 05 '22 at 10:36

Jochem · Answer 1 · 2022-01-05T12:29:51.287

0

Adding ungroup() as in the reproducab le example below.

iris %>% 
  group_by(Species) %>% 
  summarize(sum_Petal.Length = sum(Petal.Length, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(
    precentage = sum_Petal.Length / sum(sum_Petal.Length, na.rm = TRUE)
  )

edited Jan 05 '22 at 12:29

answered Jan 05 '22 at 10:08

Jochem

3,295
4
30
55

Tried adding that to my code, but that resulted in only NA's in the "percent_VannstandPos" column. I do have NA's in the column I am using values from, will that affect the output here? The NA's in that column replaces missing rows.. – Saron B Jan 05 '22 at 10:16
1

Would be easier if you can add a sample of your data with `dput(head(sy2.1, 10))` – AlexB Jan 05 '22 at 10:22
Okey will do! But how can I add the output here? New to this @AlexB – Saron B Jan 05 '22 at 10:31
Please edit your post by adding the output. You can take a look at https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – AlexB Jan 05 '22 at 10:44
Thanks! Need to learn it properly, I know adding dataframes showing the actual data really helps others answering my questions! – Saron B Jan 05 '22 at 11:18
@SaronB: I have added the code `na.rm = TRUE` to the solution. This will mitigate the NAs you may have in the data you try to sum. – Jochem Jan 05 '22 at 12:30

score 0 · Accepted Answer · answered Jan 05 '22 at 10:44

Use na.rm = TRUE to drop the NA values from the calculation in sum. You can use .groups = 'drop' in summarise to drop the groupings and since you are using year(t) more than once it is better to create a new column with the year information.

library(dplyr)
library(lubridate)

result <- sy2.1 %>% 
  mutate(year = year(t)) %>%
  filter(year==2005) %>% 
  group_by(Kategori, year) %>% 
  dplyr::summarise(sum_vannstandPos = sum(Vannstand2Positive, na.rm = TRUE), 
                  .groups = 'drop') %>% 
  dplyr::mutate(percent_vannstandPos= prop.table(sum_vannstandPos)*100)

result

prop.table(sum_vannstandPos) is another way to write sum_vannstandPos/sum(sum_vannstandPos).

Thanks! This seemed to work!! Do you know how I could simply view that directly in a boxplot @RonakShah ? — Saron B, Jan 05 '22 at 11:33

How can I find the percentage of different groups within a column?

2 Answers2