0

I have a very large data frame (think 30-50 million records), hence I'm using data.table for this problem. I am much more familiar with dplyr than with data.table.

Let's consider the following small example. Note that there are many more columns in my actual data set.

library(data.table)
library(magrittr)
library(stringi)

set.seed(42)

format_pct <- function(x){
  paste0(formatC(x * 100, digits = 1, format = 'f'), "%")
}

df <- data.frame(x = c(1, NA, 2, 4, NA),
                 y = c(0, 1, NA, 2, 5),
                 huge_numsf = sample.int(500000:1000000, size = 5),
                 huge_numsg = sample.int(500000:1000000, size = 5),
                 percent_a = format_pct(runif(5)),
                 percent_b = format_pct(runif(5)))

> df
   x  y huge_numsf huge_numsg percent_a percent_b
1  1  0     457404     259548     45.8%     94.0%
2 NA  1     468537     368294     71.9%     97.8%
3  2 NA     143070      67334     93.5%     11.7%
4  4  2     415222     328495     25.5%     47.5%
5 NA  5     320871     352530     46.2%     56.0%

I would like to apply prettyNum() to all columns except x, y, and columns with the string 'percent'.

If this data frame weren't large, I would do

df[,colnames(df)[
  !(colnames(df) %in% c("x", "y", colnames(df)[stri_detect_fixed(colnames(df), "percent")]))
  ]] <-  
  apply(X = df[,colnames(df)[
    !(colnames(df) %in% c("x", "y", colnames(df)[stri_detect_fixed(colnames(df), "percent")]))
    ]],
    MARGIN = 2,
    FUN = prettyNum,
    big.mark = ",")

> df
   x  y huge_numsf huge_numsg percent_a percent_b
1  1  0    457,404    259,548     45.8%     94.0%
2 NA  1    468,537    368,294     71.9%     97.8%
3  2 NA    143,070     67,334     93.5%     11.7%
4  4  2    415,222    328,495     25.5%     47.5%
5 NA  5    320,871    352,530     46.2%     56.0%

Assume now that we've made df a data.table; i.e.:

df <- data.frame(x = c(1, NA, 2, 4, NA),
                 y = c(0, 1, NA, 2, 5),
                 huge_numsf = sample.int(500000:1000000, size = 5),
                 huge_numsg = sample.int(500000:1000000, size = 5),
                 percent_a = format_pct(runif(5)),
                 percent_b = format_pct(runif(5))) %>% 
  data.table(.)

Is there a way to do the above using data.table syntax?

Clarinetist
  • 1,097
  • 18
  • 46

1 Answers1

1

Here is a data.table solution. Full disclosure credit should go to this previous post which I followed closely. How to apply same function to every specified column in a data.table

cols<-c("x", "y", colnames(df)[stri_detect_fixed(colnames(df), "percent")])
cols <- setdiff(colnames(df), cols)
df[ , (cols) := lapply(.SD, prettyNum, big.mark = ","), .SDcols = cols]
CCurtis
  • 1,770
  • 3
  • 15
  • 25
  • Maybe I did something wrong... but this doesn't seem to have done anything? – Clarinetist Dec 01 '17 at 20:42
  • Oh, you're missing the following: `cols <- setdiff(colnames(df), cols)` - which needs to before the final line. Also, you'd need `lapply(.SD, prettyNum, big.mark = ",")`. Once you fix these, I can award you points. – Clarinetist Dec 01 '17 at 20:46
  • Made the changes. – CCurtis Dec 01 '17 at 20:52
  • Could also specify `cols<-colnames(df)[!stri_detect_fixed(colnames(df), "percent")][-1:-2]` but only if x and y are always your first 2 columns. Sure there are a million other ways to do it. – CCurtis Dec 01 '17 at 21:18