0

I have a vector with many numbers (> 1E9 elements) and want to derive the numeric precision (number of digits in a number) and numeric scale (the number of digits to the right of the decimal point in a number).

How can I do this very fast (vectorized)?

There exists a question with a partial answer (how to return number of decimal places in R) but the solution neither fast (vectorized) nor calculates the numeric precision.

Example:

# small example vector with numeric data
x <- c(7654321, 54321.1234, 321.123, 321.123456789)

> numeric.precision(x)  # implementation is the answer
[1] 7, 9, 6, 12

> numeric.scale(x)      # implementation is the answer
[1] 0, 4, 3, 9

Optional "sugar" (added later to this question - thx to @thc and @gregor):

How can I avoid over-counting the number of digits due to internal imprecision how numbers are stored in computers (e. g. float)?

> x = 54321.1234
> as.character(x)
[1] "54321.1234"
> print(x, digits = 22)
[1] 54321.12339999999676365
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • 5
    Your input should be character strings, not numerics. The reason is that floats are allowed to be slightly imprecise. For example: a=0.15+0.15; b=0.1+0.2; a==b is false. – thc Feb 08 '17 at 19:50
  • Or, more relevant to your example data: `x = 54321.1234; print(x, digits = 22)` – Gregor Thomas Feb 08 '17 at 19:56
  • @thc Very good point! I have to mention an important precondition: Since I read my data from a CSV file into my `data.table` I can (almost ;-) guarantee that I have a limited number of digits (even though an internal conversion into a float can destroy my precondition ;-) – R Yoda Feb 08 '17 at 19:58
  • 1
    A good way to get the number of digits to the *left* of the decimal point is `trunc(log10(abs(x))) + 1`. I leave it here in case it's useful in full answers. I'm not sure how it would compare speed-wise with a conversion to character. – Gregor Thomas Feb 08 '17 at 20:05
  • 1
    If you really can "guarantee that I have a limited number of digits", then use lmo's `nchar(sub())` method on `format(x, digits = maxp, scientific = FALSE)` where `maxp` is the maximum precision you expect in your data. – Gregor Thomas Feb 08 '17 at 20:11
  • @Gregor The log10 solution is really quite fast! Any idea how to calculate the decimal places using a similar non-string-based algorithm? – R Yoda Feb 08 '17 at 20:19
  • Damn, R is very strict, I am afraid the `digits` parameter does not work as hoped: `format(54321.1234, digits = 6, scientific = FALSE)` results in `[1] "54321.1"`, `format(54321.1234, digits = 22, scientific = FALSE)` in `[1] "54321.12339999999676365"`. Both not helpful in case of a vector of numbers. – R Yoda Feb 08 '17 at 20:41
  • 1
    Yes, but you said you could guarantee a *limited* number of digits. We already demonstrated that 22 is excessive - `format(x, digits = 14, scientific = F, trim = T, drop0trailing = T)` works for your example and has a bit of cushion. You can put it up to 16 without problems in this example. – Gregor Thomas Feb 08 '17 at 20:47
  • @Gregor Now I understand why you said "guarantee a limited number", thx :-) 16 is big enough in my case. – R Yoda Feb 08 '17 at 20:57
  • See also http://stackoverflow.com/q/2377174/4468078 – R Yoda Feb 09 '17 at 21:52

3 Answers3

3

Here is a base R method to start with It is bound to be too slow, but at least calculates the desired results.

# precision
nchar(sub(".", "", x, fixed=TRUE))
[1]  7  9  6 12

# scale
nchar(sub("\\d+\\.?(.*)$", "\\1", x))
[1] 0 4 3 9

For this method, I'd recommend using the colClasses argument in with data.table's fread to avoid conversion to numeric precision issues in the first place:

x <- unlist(fread("7654321
54321.1234
321.123
321.123456789", colClasses="character"), use.names=FALSE)

It may be necessary to convert the vector to numeric during the input, as mentioned in the comments, for example some of the input values are in scientific notation in the text file. In this instance, using a formatting statement or options(scipen=999) to force the conversion from this format to standard decimal format may be necessary as noted in this answer.

Community
  • 1
  • 1
lmo
  • 37,904
  • 9
  • 56
  • 69
  • 2
    You should probably use `format(x,scientific=FALSE,...)` with other arguments as necessary to prevent cases like `nchar(sub(".", "", as.character(10000000000), fixed=TRUE)) == 5` – A. Webb Feb 08 '17 at 20:03
  • 1
    @A.Webb Thanks for the comment. I've added an alternative to numeric coercion which may be preferable in terms of the numerical precision issue. – lmo Feb 08 '17 at 20:25
  • @RichScriven Ah yes. Thanks. I often forget that the regex functions have this nice feature. – lmo Feb 08 '17 at 20:46
  • @Imo I am going to accept your answer as the best solution. Would you mind to add `options(scipen=999)` (from http://stackoverflow.com/a/5352328/4468078) to disable the scientific notation which causes under-counting the precision? My test vector for this is: `x <- c(7654321, 54321.1234, 321.123, 321.123456789, 54321.1234, 100000000000, 1E4)` – R Yoda Feb 08 '17 at 21:34
  • Please note that `NA`s are currently counted wrong (precision 2 and scale 2). If the vector contained only 1 digit numbers and NAs the result would be (slightly) wrong. – R Yoda Feb 08 '17 at 22:10
1

Here is idea of math version (faster then manipulate with characters). You can put this in functions scale and precision, where in function precision call scale function.

for (i in 1:length(x)) {
     after <- 0
     while(x[i]*(10^after) != round(x[i]*(10^after))) 
     { after <- after + 1 }
     cat(sprintf("Scale: %s\n", after))
     before <- floor(log10(abs(x[i])))+1
     cat(sprintf("Precision: %s\n", before+after))
 }

Result:

Scale: 0
Precision: 7
Scale: 4
Precision: 9
Scale: 3
Precision: 6
Scale: 9
Precision: 12
Nejc Galof
  • 2,538
  • 3
  • 31
  • 70
  • Clever algorithm (estimating the number of digits of the fractional part is really difficult). From a practical point of view I think this solution will be too slow for me since it does not support vectorization (but loops over all elements in the vector). – R Yoda Feb 08 '17 at 20:49
0

Just to consolidate all comments and answers into one ready-to-use solution that also considers different countries (locales) and NA I post this as an answer (please give credits to @Imo, @Gregor et al.).

Edit (Feb 09, 2017): Added the SQL.precision as return value since it may be different from the mathematical precision.

#' Calculates the biggest precision and scale that occurs in a numeric vector
#'
#' The scale of a numeric is the count of decimal digits in the fractional part (to the right of the decimal point).
#' The precision of a numeric is the total count of significant digits in the whole number,
#' that is, the number of digits to both sides of the decimal point. 
#'
#' To create a suitable numeric data type in a SQL data base use the returned \code{SQL.precision} which
#' is defined by \code{max(precision, non.fractional.precision + scale)}.
#'
#' @param x numeric vector
#'
#' @return A list with four elements:
#'         precision (total number of significant digits in the whole number),
#'         scale (number of digits in the fractional part),
#'         non.fractional.precision (number of digits at the left side and SQL precision.
#'
#' @details NA will be counted as precision 1 and scale 0!
#'
#' @examples
#'
#' \preformatted{
#' x <- c(0, 7654321, 54321.1234, 321.123, 321.123456789, 54321.1234, 100000000000, 1E4, NA)
#' numeric.precision.and.scale(x)
#' numeric.precision.and.scale(c(10.0, 1.2))   # shows why the SQL.precision is different
#' }
numeric.precision.and.scale <- function(x) {

  # Remember current options
  old.scipen <- getOption("scipen")

  # Overwrite options
  options(scipen = 999)   # avoid scientific notation when converting numerics to strings

  # Extract the decimal point character of the computer's current locale
  decimal.sign <- substr( 1 / 2, 2, 2)

  x.string <- as.character(x[!is.na(x)])

  if (length(x.string) > 0) {
    # calculate
    precision <- max(nchar(sub(decimal.sign, "", x.string, fixed = TRUE)))
    scale <- max(nchar(sub(paste0("\\d+\\", decimal.sign, "?(.*)$"), "\\1", x.string)))
    non.fractional.precision <- max(trunc(log10(abs(x))) + 1, na.rm = TRUE)
    SQL.precision <- max(precision, non.fractional.precision + scale)

    # Reset changed options
    options(scipen = old.scipen)
  } else {
    precision <- 1
    scale <- 0
    non.fractional.precision <- 1
    SQL.precision <- 1
  }

  return(list(precision = precision,
              scale = scale,
              non.fractional.precision = non.fractional.precision,
              SQL.precision = SQL.precision))
}
R Yoda
  • 8,358
  • 2
  • 50
  • 87