3

I am creating Benford plots for all the numeric variables in my dataset. https://en.wikipedia.org/wiki/Benford%27s_law

Running a single variable

#install.packages("benford.analysis")
library(benford.analysis)
plot(benford(iris$Sepal.Length))

Looks great. And the legend says "Dataset: iris$Sepal.Length", perfect!.

Benford 1

Using apply to run 4 variables,

apply(iris[1:4], 2, function(x) plot(benford(x)))

Creates four plots, however, each plot's legend says "Dataset: x"

Benford 2

I attempted to use a for loop,

for (i in colnames(iris[1:4])){
  plot(benford(iris[[i]]))
}

This creates four plots, but now the legends says "Dataset: iris[[i]]". And I would like the name of the variable on each chart.

Benford 3

I tried a different loop, hoping to get titles with an evaluated parsed string like "iris$Sepal.Length":

for (i in colnames(iris[1:4])){
  plot(benford(eval(parse(text=paste0("iris$", i)))))
}

But now the legend says "Dataset: eval(parse(text=paste0("iris$", i)))".

Benford 4

AND, Now I've run into the infamous eval(parse(text=paste0( (eg: How to "eval" results returned by "paste0"? and R: eval(parse(...)) is often suboptimal )

I would like labels such as "Dataset: iris$Sepal.Length" or "Dataset: Sepal.Length". How can I create multiple plots with meaningfully variable names in the legend?

Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66
M.Viking
  • 5,067
  • 4
  • 17
  • 33

2 Answers2

1

This is happening because of the first line within the benford function=:

benford <- function(data, number.of.digits = 2, sign = "positive", discrete=TRUE, round=3){

  data.name <- as.character(deparse(substitute(data)))

Source: https://github.com/cran/benford.analysis/blob/master/R/functions-new.R

data.name is then used to name your graph. Whatever variable name or expression you pass to the function will unfortunately be caught by the deparse(substitute()) call, and will be used as the name for your graph.


One short-term solution is to copy and rewrite the function:

#install.packages("benford.analysis")
library(benford.analysis)
#install.packages("data.table")
library(data.table) # needed for function

# load hidden functions into namespace - needed for function
r <- unclass(lsf.str(envir = asNamespace("benford.analysis"), all = T))
for(name in r) eval(parse(text=paste0(name, '<-benford.analysis:::', name)))


benford_rev <- function{} # see below

for (i in colnames(iris[1:4])){
   plot(benford_rev(iris[[i]], data.name = i))
}

enter image description here

enter image description here

This has negative side effects of:

  • Not being maintainable with package revisions
  • Fills your GlobalEnv with normally hidden functions in the package

So hopefully someone can propose a better way!


benford_rev <- function(data, number.of.digits = 2, sign = "positive", discrete=TRUE, round=3, data.name = as.character(deparse(substitute(data)))){ # changed

 # removed line

  benford.digits <- generate.benford.digits(number.of.digits)

  benford.dist <- generate.benford.distribution(benford.digits)

  empirical.distribution <- generate.empirical.distribution(data, number.of.digits,sign, second.order = FALSE, benford.digits)

  n <- length(empirical.distribution$data)

  second.order <- generate.empirical.distribution(data, number.of.digits,sign, second.order = TRUE, benford.digits, discrete = discrete, round = round)

  n.second.order <- length(second.order$data)

  benford.dist.freq <- benford.dist*n

  ## calculating useful summaries and differences
  difference <- empirical.distribution$dist.freq - benford.dist.freq

  squared.diff <- ((empirical.distribution$dist.freq - benford.dist.freq)^2)/benford.dist.freq

  absolute.diff <- abs(empirical.distribution$dist.freq - benford.dist.freq)

  ### chi-squared test
  chisq.bfd <- chisq.test.bfd(squared.diff, data.name)

  ### MAD
  mean.abs.dev <- sum(abs(empirical.distribution$dist - benford.dist)/(length(benford.dist)))

  if (number.of.digits > 3) {
    MAD.conformity <- NA
  } else {
    digits.used <- c("First Digit", "First-Two Digits", "First-Three Digits")[number.of.digits]  
    MAD.conformity <- MAD.conformity(MAD = mean.abs.dev, digits.used)$conformity
  }





  ### Summation
  summation <- generate.summation(benford.digits,empirical.distribution$data, empirical.distribution$data.digits)
  abs.excess.summation <- abs(summation - mean(summation))

  ### Mantissa
  mantissa <- extract.mantissa(empirical.distribution$data)
  mean.mantissa <- mean(mantissa)
  var.mantissa <- var(mantissa)
  ek.mantissa <- excess.kurtosis(mantissa)
  sk.mantissa <- skewness(mantissa)

  ### Mantissa Arc Test
  mat.bfd <- mantissa.arc.test(mantissa, data.name)

  ### Distortion Factor
  distortion.factor <- DF(empirical.distribution$data)  

  ## recovering the lines of the numbers
  if (sign == "positive") lines <- which(data > 0 & !is.na(data))
  if (sign == "negative") lines <- which(data < 0 & !is.na(data))
  if (sign == "both")     lines <- which(data != 0 & !is.na(data))
  #lines <- which(data %in% empirical.distribution$data)

  ## output
  output <- list(info = list(data.name = data.name,
                             n = n,
                             n.second.order = n.second.order,
                             number.of.digits = number.of.digits),

                 data = data.table(lines.used = lines,
                                   data.used = empirical.distribution$data,
                                   data.mantissa = mantissa,
                                   data.digits = empirical.distribution$data.digits),

                 s.o.data = data.table(second.order = second.order$data,
                                       data.second.order.digits = second.order$data.digits),

                 bfd = data.table(digits = benford.digits,
                                  data.dist = empirical.distribution$dist,
                                  data.second.order.dist = second.order$dist,
                                  benford.dist = benford.dist,
                                  data.second.order.dist.freq = second.order$dist.freq,
                                  data.dist.freq = empirical.distribution$dist.freq,
                                  benford.dist.freq = benford.dist.freq,
                                  benford.so.dist.freq = benford.dist*n.second.order,
                                  data.summation = summation,
                                  abs.excess.summation = abs.excess.summation,
                                  difference = difference,
                                  squared.diff = squared.diff,
                                  absolute.diff = absolute.diff),

                 mantissa = data.table(statistic = c("Mean Mantissa", 
                                                     "Var Mantissa", 
                                                     "Ex. Kurtosis Mantissa",
                                                     "Skewness Mantissa"),
                                       values = c(mean.mantissa = mean.mantissa,
                                                  var.mantissa = var.mantissa,
                                                  ek.mantissa = ek.mantissa,
                                                  sk.mantissa = sk.mantissa)),
                 MAD = mean.abs.dev,

                 MAD.conformity = MAD.conformity,

                 distortion.factor = distortion.factor,

                 stats = list(chisq = chisq.bfd,
                              mantissa.arc.test = mat.bfd)
  )

  class(output) <- "Benford"

  return(output)

}
Chris
  • 6,302
  • 1
  • 27
  • 54
  • I have continued looking at functions; stacking and nesting and alternating additional `deparse` `substitute` `eval` `parse`. And my history is full of other commands like `parseq` `noquote` `with` `assign` `quote` `enquote` `getParseData`. – M.Viking Jul 10 '19 at 13:41
1

I have just updated the package (GitHub version) to allow for a user supplied name.

Now the function has a new parameter called data.name in which you can provide a character vector with the name of the data and override the default. Thus, for your example you can simply run the following code.

First install the GitHub version (I will submit this version to CRAN soon).

devtools::install_github("carloscinelli/benford.analysis") # install new version

Now you can provide the name of the data inside the for loop:

library(benford.analysis)

for (i in colnames(iris[1:4])){
  plot(benford(iris[[i]], data.name = i))
}

And all the plots will have the correct naming as you wish (below).

Created on 2019-08-10 by the reprex package (v0.2.1)

Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66