Handling NAs when mapping several dataframe columns into their percentile values in R

Question

> dput(zed)
    structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742, 
    0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209, 
    0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946, 
    0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367, 
    0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195, 
    0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567, 
    0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022, 
    0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087, 
    0.141872719902716, 0, 0.176720075400566, 0.253924284395199), 
        col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226, 
        0.564015792442188, NaN, 0.116857208286359, 0.136034761917893, 
        0.137370134394451, 0.227357158778513, 0.215714919326088, 
        0.240671647524362, 0.107512520868114, 0.0681162324911809, 
        0.195274360476469, NaN, 0.208033156719459, 0.199848016844409, 
        0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA, 
    -20L), class = c("tbl_df", "tbl", "data.frame"))

> zed
# A tibble: 20 x 3
       col1     col2     col3
      <dbl>    <dbl>    <dbl>
 1   0        0.122    0.228 
 2   0.236    0.180    0.265 
 3   0.438    0.306    0.190 
 4   0        0        0.564 
 5 NaN      NaN      NaN     
 6   0.199    0.239    0.117 
 7   0.0755   0.0626   0.136 
 8   0.102    0.162    0.137 
 9   0.0459   0.0613   0.227 
10   0.256    0.220    0.216 
11   0.0881   0.136    0.241 
12   0.117  NaN        0.108 
13   0.128    0        0.0681
14   0        0        0.195 
15   0.118    0.0934 NaN     
16   0.115    0.140    0.208 
17   0.0584   0.142    0.200 
18   0.123    0        0.140 
19   0.197    0.177    0.202 
20   0.0370   0.254    0.0927

I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).

Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.

> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).

I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!

You could probably take advantage of something like `is.finite()` to subset only the finite values in each column, OR follow some of the advice here: https://stackoverflow.com/questions/21219447/calculating-percentile-of-dataset-column — Chase, Jan 02 '19 at 01:26
So what do you want to do with `NaN` values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is? — Ronak Shah, Jan 02 '19 at 02:13
@RonakShah preferably they have an NaN percentile value, so ignored from calculation — Canovice, Jan 03 '19 at 15:42

score 1 · Accepted Answer · answered Jan 02 '19 at 02:18

There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.

Here is the code of the percentile function.

library(dplyr)
library(fmsb)

percentile
# function (dat) 
# {
#   pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
#   pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
#   pt3 <- rownames(pt2)
#   pt4 <- as.integer(strsplit(pt3, "%"))
#   datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
#   return(datp)
# }
# <bytecode: 0x0000000016c498b0>
#   <environment: namespace:fmsb>

As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.

In addition, when providing a vector with zero, the function returns error as follows.

percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) : 
#  'breaks' are not unique

My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.

percentile_narm_zero <- function(dat, small = 0.0000000000001){

  # Create a data frame with the numeric values and index
  dat2 <- data.frame(index = 1:length(dat), dat = dat)
  # Remove NA
  dat3 <- dat2[ !is.na(dat2$dat), ]
  # Add a small number to 0
  dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)

  # This part is the same as the percentile function
  pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
  pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
  pt3 <- rownames(pt2)
  pt4 <- as.integer(strsplit(pt3, "%"))
  datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]

  # Merge datp back to dat2
  dat3$datp <- datp
  dat4 <- merge(dat2, dat3, by = "index", all = TRUE)

  return(dat4$datp)
}

Now we can apply this function to all columns in zed using mutate_all.

zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
#       col1     col2     col3 col1_pctile col2_pctile col3_pctile
#      <dbl>    <dbl>    <dbl>       <int>       <int>       <int>
#  1   0        0.122    0.228           11          42          83
#  2   0.236    0.180    0.265           89          77          95
#  3   0.438    0.306    0.190          100         100          42
#  4   0        0        0.564           11          17         100
#  5 NaN      NaN      NaN               NA          NA          NA
#  6   0.199    0.239    0.117           84          89          18
#  7   0.0755   0.0626   0.136           34          30          24
#  8   0.102    0.162    0.137           45          65          30
#  9   0.0459   0.0613   0.227           23          24          77
# 10   0.256    0.220    0.216           95          83          71
# 11   0.0881   0.136    0.241           39          48          89
# 12   0.117  NaN        0.108           56          NA          12
# 13   0.128    0        0.0681          73          17           0
# 14   0        0        0.195           11          17          48
# 15   0.118    0.0934 NaN               62          36          NA
# 16   0.115    0.140    0.208           50          53          65
# 17   0.0584   0.142    0.200           28          59          53
# 18   0.123    0        0.140           67          17          36
# 19   0.197    0.177    0.202           78          71          59
# 20   0.0370   0.254    0.0927          17          95           6

wonderful. i did realize that `percentile()` did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks. — Canovice, Jan 03 '19 at 15:44

score 0 · Answer 2 · answered Jan 02 '19 at 02:19

First define a function to calculate percentile group as:

percentile_group <- function(x)
{
  y <- as.numeric(x) %>% discard(is.na)
  qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
  grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
  #return(qn)
  return(grp)
}

Now use the function in a mutate statement as

 mutate_if(zen, is.numeric, funs(pctile = percentile_group))

The output is:

# A tibble: 20 x 6
col1     col2     col3 col1_pctile col2_pctile col3_pctile
<dbl>    <dbl>    <dbl>       <int>       <int>       <int>
  1   0        0.122    0.228            1           4           9
2   0.236    0.180    0.265            8           7          10
3   0.438    0.306    0.190            9           9           5
4   0        0        0.564            1           1          10
5 NaN      NaN      NaN               NA          NA          NA
6   0.199    0.239    0.117            8           8           2
7   0.0755   0.0626   0.136            3           2           3
8   0.102    0.162    0.137            4           6           3
9   0.0459   0.0613   0.227            2           2           8
10   0.256    0.220    0.216            9           8           8
11   0.0881   0.136    0.241            3           4           9
12   0.117  NaN        0.108            5          NA           2
13   0.128    0        0.0681           7           1           1
14   0        0        0.195            1           1           5
15   0.118    0.0934 NaN                6           3          NA
16   0.115    0.140    0.208            4           5           7
17   0.0584   0.142    0.200            2           5           6
18   0.123    0        0.140            6           1           4
19   0.197    0.177    0.202            7           7           6
20   0.0370   0.254    0.0927           1           9           1

Handling NAs when mapping several dataframe columns into their percentile values in R

2 Answers2