1

I have a data-frame with several missing values (NAs), grouped into a number of groups (A,B,C,D,E,F) in a column named Group. I am trying to analyse it in R.

I want to tabulate the number of rows/records belonging to each group, or better still the proportion of rows/records within each group (ie out of the total number of rows/records within that group), which have: >= 1 NA (i.e. at least 1 missing val), >=2 NAs, >=3 NAs, >=4 NAs, so forth (up to n NAs, which I would predefine).

I want to do it with just base R and dplyr. I guess a solution would start with df %>% group_by(Group) but I'm not sure where to go from there.

Any ideas?

Edit: Let's take iris for data:

iris0 <- iris
set.seed(101)
iris0[sample(150,40),1] <- NA
iris0[sample(150,40),2] <- NA
iris0[sample(150,40),3] <- NA
iris0[sample(150,40),4] <- NA

At this point I would like (after supplying say max_n=7) a table somewhat like the following

                1 NAs 2 NAs 3 NAs 4 NAs
1   setosa      30    13    3   0   0   0   0
2   versicolor  40    18    5   0   0   0   0
3   virginica   36    13    2   0   0   0   0
4   Total      106    44    10  0   0   0   0
5   (%)       70.67 29.33   6.67    0.00    0.00    0.00    0.00

Actually I came up with a method that does it using rowsums and lapply but I wonder if anyone can help with a more efficient method (using base R and dplyr) e.g. with aggregate or something. Unboxing the lapply results into a table like the above is also a major pain.

Also, better still would be something that divides the numbers by the total records in each class (for the proportion in each class).

Mobeus Zoom
  • 598
  • 5
  • 19

1 Answers1

2

Although you did not make data available, I suppose you have the following kind of (simplified) problem where

Z = data.frame(Group = c("A", "A", "B", "B", "C", "C", "C"), 
               value = c(1, NA, 1, 1, NA, NA, 1))
  Group value
1     A     1
2     A    NA
3     B     1
4     B     1
5     C    NA
6     C    NA
7     C     1

and now you want to count the proportion of entries for each group that are not NA, correct?

You can do this with

aggregate(value ~ Group, 
          data = Z, 
          function(x) {
                       sum(!is.na(x))/length(x)
                       }, 
          na.action = NULL)

and obtain the outcome

  Group     value
1     A 0.5000000
2     B 1.0000000
3     C 0.3333333 

EDIT:

After the provision of data, we need to slightly tweak the code to arrive at the result.

First do

Test = iris0 %>% 
  dplyr::mutate(NAS = rowSums(is.na(.))) %>% 
  dplyr::group_by(Species) 

and then we do

aggregate(NAS ~ Species, 
          data = Test, 
          function(x) {
            table(x)
          }
)

to arrive at

     Species NAS.0 NAS.1 NAS.2 NAS.3
1     setosa    20    17    10     3
2 versicolor    10    22    13     5
3  virginica    14    23    11     2

EDIT II:

To address your further questions in the comments:

(1) You can index and extract the usual way, e.g.

Test[Test$NAS == 3 & Test$Species == "virginica"]

(2) We can also get a self-made cumulative table for the example by means of

Z = aggregate(NAS ~ Species, 
      data = Test, 
      function(x) {
        table(x)
      }
)
(t(
   apply(Z[,2:dim(Z)[2]], 1, cumsum)
                                    ) 
                                      - 50)*(-1)

with the result

      0  1 2 3
[1,] 30 13 3 0
[2,] 40 18 5 0
[3,] 36 13 2 0

This is a quick fix, of course, but working. Note that in the general case you need to replace 50 with the vectors of group counts.

(3) It stops at 3 NAs because there are simply no more columns to hold more NAs.

Taufi
  • 1,557
  • 8
  • 14
  • I'm afraid not. My data has several columns. I want to count number of **rows** per class, that (within that row) have at least **n** NAs or more. I will update the post with data. – Mobeus Zoom May 13 '20 at 15:07
  • I've added a full reproducible example – Mobeus Zoom May 13 '20 at 15:21
  • I just updated my answer in light of the new specifications you gave. – Taufi May 13 '20 at 16:07
  • Thank you. This is a neat answer but there are a few problems with it: (1) it's not in a format I can work with (how do I index to extract, say, the number of records in class B that have 2 NAs?). (2) I'd like a cumulative count (i.e. **greater than** 1 NA, 2 NAs, ...). (3) How did the table decide to stop at 3 NAs? I wanted to supply this myself... trying it on my data I note that it doesn't stop when all NAs stop (but rather before that). – Mobeus Zoom May 13 '20 at 16:27
  • I just updated the answer to respond to your caveats. – Taufi May 13 '20 at 16:58
  • Thank you. When I try to index in the way you describe, I get ```character(0)```. I also found - on a much larger dataset with many more columns - that the NAs columns stop far before we reach no of columns (or even max no of NAs in a row). – Mobeus Zoom May 13 '20 at 17:16
  • The reason for which I'm quite sure is that the result is actually a list with number of NAs and then number of records which have that number of NAs, for each class. Because this list doesn't record 0s, it goes straight from say '3'=6 which is 6 records with 3 NAs, to say '5'=1 which indicates 1 record with 5 NAs, with no mention of '4'. So it's complete but not usable like this, since for neat tabulation we need the 0s. – Mobeus Zoom May 13 '20 at 17:28
  • Sorry, but I do not follow. It shouldn't be problematic that there is no row with 4 NAs as it would record it with '4' = 0. Also, I do not see what list you mean and what kind of 0s are missing. – Taufi May 13 '20 at 17:43
  • Unfortunately it doesn't record it with '4'=0 (or at all). – Mobeus Zoom May 13 '20 at 17:45