I have a data-frame with several missing values (NAs), grouped into a number of groups (A,B,C,D,E,F) in a column named Group
. I am trying to analyse it in R.
I want to tabulate the number of rows/records belonging to each group, or better still the proportion of rows/records within each group (ie out of the total number of rows/records within that group), which have: >= 1 NA (i.e. at least 1 missing val), >=2 NAs, >=3 NAs, >=4 NAs, so forth (up to n NAs, which I would predefine).
I want to do it with just base R and dplyr
. I guess a solution would start with df %>% group_by(Group)
but I'm not sure where to go from there.
Any ideas?
Edit: Let's take iris for data:
iris0 <- iris
set.seed(101)
iris0[sample(150,40),1] <- NA
iris0[sample(150,40),2] <- NA
iris0[sample(150,40),3] <- NA
iris0[sample(150,40),4] <- NA
At this point I would like (after supplying say max_n=7
) a table somewhat like the following
1 NAs 2 NAs 3 NAs 4 NAs
1 setosa 30 13 3 0 0 0 0
2 versicolor 40 18 5 0 0 0 0
3 virginica 36 13 2 0 0 0 0
4 Total 106 44 10 0 0 0 0
5 (%) 70.67 29.33 6.67 0.00 0.00 0.00 0.00
Actually I came up with a method that does it using rowsums and lapply but I wonder if anyone can help with a more efficient method (using base R and dplyr) e.g. with aggregate or something. Unboxing the lapply results into a table like the above is also a major pain.
Also, better still would be something that divides the numbers by the total records in each class (for the proportion in each class).