4

My data is in this format:

country gdp digits
US      100 2657
Aus     50  123
NZ      40  11

and I'd like to take the mean, for each country of the individual digits that are all stored in the digits column.

So this is what I'm after:

country gdp digits mean_digits
US      100 2657   5
Aus     50  123    2
NZ      40  11     1

I imagine I should split the digits column into individual digits in separate columns and then take an arithmetic mean, but I was just a little unsure, because different rows have different numbers of digits in the digits field.

Code for the reproducable data below:

df <- data.frame(stringsAsFactors=FALSE,
     country = c("US", "AUS", "NZ"),
         gdp = c(100, 50, 40),
      digits = c(2657, 123, 11)
)
Jeremy K.
  • 1,710
  • 14
  • 35

6 Answers6

4

We need a function to split the number into digits and take the mean:

mean_digits = function(x) {
  sapply(strsplit(as.character(x), split = "", fixed = TRUE),
         function(x) mean(as.integer(x)))
}

df$mean_digits = mean_digits(df$digits)
df
#   country gdp digits mean_digits
# 1      US 100   2657           5
# 2     AUS  50    123           2
# 3      NZ  40     11           1

as.character() converts the numeric input to character, strsplit splits the numbers into individual digits (resulting in a list), then with sapply, to each list element we convert to integer and take the mean.

We use fixed = TRUE for a little bit of efficiency, since we don't need any special regex to split every digit apart.

If you're using this function frequently, you may want to round or check that the input is integer, it will return NA if the input has decimals due to the ..

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    You explained the code so clearly. I'd be a little lost with the `sapply` without your explanation, so very much appreciated. – Jeremy K. Jun 11 '19 at 12:25
  • with the second `function(x)` within the `mean_digits` function, is there recursion going on there? I'm trying to work out what the second `function(x)` does. Thank you. – Jeremy K. Jun 11 '19 at 16:20
  • 1
    No recursion. `sapply` only applies a single function to the data. I want to apply 2 functions, first `as.integer`, then `mean`. I could make a function `my_function = function(x) mean(as.integer(x))`, but I'm only going to use it this one time inside `sapply`, so I don't want to go to the bother of naming. So instead I can just pass the unnamed definition of the function, `function(x) mean(as.integer(x))`, which is called an *anonymous function*. – Gregor Thomas Jun 11 '19 at 16:36
  • 1
    I used `x` out of habit---it doesn't matter that the `mean_digits` function also called it's argument `x`. I could have used `function(blarg) mean(as.integer(blarg))` just as well – Gregor Thomas Jun 11 '19 at 16:36
  • 1
    Another way to think of it: Let's say I have `dd = c(1, 2, 3, NA, 5)` and I want to get the average, removing the `NA`. A lame way to do it is `my_narm_pref = TRUE` and then `mean(dd, na.rm = my_narm_pref)`. That's pointless though, unless I want to re-use that preference in different functions, so I'll just do `mean(dd, na.rm = TRUE) `. This is just the same, except that instead of a simple argument like `TRUE` or `FALSE`, `sapply` takes a function as an argument. I don't want to save and reuse the function, so I pass the function directly without naming it. – Gregor Thomas Jun 11 '19 at 16:40
  • The explanation above is so helpful. Just in case anyone is looking at this question later on, here's a link to an explanation of *anonymous functions* http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions – Jeremy K. Jun 11 '19 at 16:54
2

One tidyverse possibility could be:

df %>%
 mutate(digits = str_split(digits, pattern = "")) %>%
 unnest() %>%
 group_by(country, gdp) %>%
 summarise(digits = mean(as.numeric(digits)))

  country   gdp digits
  <chr>   <int>  <dbl>
1 Aus        50      2
2 NZ         40      1
3 US        100      5

Or:

df %>%
 mutate(digits = str_split(digits, pattern = "")) %>%
 unnest() %>%
 group_by(country, gdp) %>%
 summarise_all(list(~ mean(as.numeric(.))))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
2

Here is a stringr alternative. It uses sapply with str_extract_all to extract the characters of df$digits for each row and calculates the mean.

library(stringr)
df$mean_digits <- sapply(str_extract_all(df$digits, ".{1}"), function(x) mean(as.numeric(x)))

df
  country gdp digits mean_digits
1      US 100   2657           5
2     AUS  50    123           2
3      NZ  40     11           1

Or, if you really wanted to, you could do it by using the matrix output from str_extract_all and rowMeans. Note: for str_extract_all, simplify = FALSE is the default.

extracted_mat <- str_extract_all(df$digits, ".{1}", simplify = TRUE)
class(extracted_mat) <- "numeric"

df$mean_digits <- rowMeans(extracted_mat, na.rm = T)

EDIT: running benchmarks on a larger scale (i.e., using @Gregor's sample suggestion).

# Packages 
library(stringr)
library(gsubfn)

# Functions
mean_digits = function(x) {
  sapply(strsplit(as.character(x), split = "", fixed = TRUE),
         function(x) mean(as.integer(x)))
}
mnDigit <- function(x) {
  n <- nchar(x)
  sq <- as.numeric(paste0("1e", n:0))
  mean((x %% sq[-length(sq)]) %/% sq[-1])
}
mnDigit2 <- function(a) {
  dig <- ceiling(log10(a + 1))
  vec1 <- 10^(dig:1)
  vec2 <- vec1 / 10
  mean((a %% vec1) %/% vec2)
}

# Creating x
set.seed(1)
x = sample(1:1e7, size = 5e5)


microbenchmark::microbenchmark(mnDigit2=sapply(x, mnDigit2),
                               mnDigit=sapply(x, mnDigit),
                               stringr=sapply(str_extract_all(x, ".{1}"), function(x) mean(as.numeric(x))),
                               stringr_matrix = {
                                 extracted_mat <- str_extract_all(x, ".{1}", simplify = TRUE)
                                 class(extracted_mat) <- "numeric"
                                 rowMeans(extracted_mat, na.rm = T)
                               },
                               strsplit=mean_digits(x),
                               rowMeans=rowMeans(read.table(text = gsub("\\b", " ", x), fill = NA), na.rm = TRUE),
                               #strapply=sapply(strapply(x, ".", as.numeric, simplify=TRUE), mean),
                               times = 10)
Unit: milliseconds
           expr       min       lq     mean   median       uq      max neval  cld
       mnDigit2 3154.4249 3226.633 3461.847 3445.867 3612.690 3840.691    10   c 
        mnDigit 6403.7460 6613.345 6876.223 6736.304 6965.453 7634.197    10    d
        stringr 3277.0188 3628.581 3765.786 3711.022 3808.547 4347.229    10   c 
 stringr_matrix  944.5599 1029.527 1136.334 1090.186 1169.633 1540.976    10 a   
       strsplit 3087.6628 3259.925 3500.780 3416.607 3585.573 4249.027    10   c 
       rowMeans 1354.5196 1449.871 1604.305 1594.297 1745.088 1828.070    10  b 



identical(sapply(x, mnDigit2), sapply(x, mnDigit))
[1] TRUE
identical(sapply(x, mnDigit2), sapply(str_extract_all(x, ".{1}"), function(x) mean(as.numeric(x))))
[1] TRUE
identical(sapply(x, mnDigit2), {
  extracted_mat <- str_extract_all(x, ".{1}", simplify = TRUE)
  class(extracted_mat) <- "numeric"
  rowMeans(extracted_mat, na.rm = T)
})
[1] TRUE
identical(sapply(x, mnDigit2), mean_digits(x))
[1] TRUE
identical(sapply(x, mnDigit2), rowMeans(read.table(text = gsub("\\b", " ", x), fill = NA), na.rm = TRUE))
[1] TRUE
Andrew
  • 5,028
  • 2
  • 11
  • 21
2

An another tidyverse one-liner w/o other dependencies:

df %>% mutate(mean_digits =  map_dbl(strsplit(as.character(df$digits), ""), 
                                     ~ mean(as.numeric(.x))))
#   country gdp digits mean_digits
# 1      US 100   2657           5
# 2     AUS  50    123           2
# 3      NZ  40     11           1

Explanation

  1. You use strsplit to split the digits into single digits. This gives you a list where each element contains the single digits.
  2. Then you loop over this list and calculate the mean over these digits. Here we use map_dbl from purrr but a simple sapply would also do the trick.

Or a solution based on arithmetics rather than string spliiting:

df %>% mutate(mean_digits = 
                map_dbl(digits, 
                        ~ mean((.x %/% 10 ^ (0:(nchar(as.character(.x)) - 1)) %% 10))))

Explanation

You integer divide (%/%) each number by powers of 10 (i.e. 10^0, 10^1, 10^2, ..., 10^i up to the number of digits and you take this result modulo 10 (which gives you exactly the original digit). Then you calculate the mean.


Bare functions to be used for benchmarking

split_based <- function(x) {
   sapply(strsplit(as.character(x), ""), 
            function(.x) mean(as.numeric(.x)))
}

## split_based(df$digits)

arithmetic_based <- function(.x) {
   mean((.x %/% 10 ^ (0:(nchar(as.character(.x)) - 1)) %% 10))
}

## sapply(df$digits, arithmetic_based)
thothal
  • 16,690
  • 3
  • 36
  • 71
2

This might more efficiently be done with aritmetics.

Inspired from this solution we could do:

mnDigit <- function(x) {
  n <- nchar(x)
  sq <- as.numeric(paste0("1e", n:0))
  mean((x %% sq[-length(sq)]) %/% sq[-1])
}

sapply(df$digits, mnDigit)
# [1] 5 2 1

Explanation: In the function nchar first counts the digits and creates a vector of powers of 10. The final line basically counts each power of 10 in modulo.

Applying the "more general solution" mentioned in the linked answer would look like this (thx to @thothal for fixing the error):

mnDigit2 <- function(a) {
  dig <- ceiling(log10(a + 1))
  vec1 <- 10^(dig:1)
  vec2 <- vec1 / 10
  mean((a %% vec1) %/% vec2)
}

Let's take a look at the benchmark:

  Unit: milliseconds
            expr        min         lq      mean    median         uq         max neval cld
mnDigit2          140.65468  152.48952  173.7740  171.3010  179.23491   248.25977    10  a 
mnDigit           130.21340  151.76850  185.0632  166.7446  193.03661   292.59642    10  a 
stringr           112.80276  116.17671  129.7033  130.6521  137.24450   149.82282    10  a 
strsplit          106.64857  133.76875  155.3771  138.6853  148.58234   257.20670    10  a 
rowMeans           27.58122   28.55431   37.8117   29.5755   41.82507    66.96972    10  a 
strapply         6260.85467 6725.88120 7673.3511 6888.5765 8957.92438 10773.54486    10   b
split_based       363.59171  432.15120  475.5603  459.9434  528.20592   623.79144    10  a 
arithmetic_based  137.60552  172.90697  195.4316  183.1395  208.44365   292.07671    10  a

Note: I've taken out the tidyverse solutions because they are too nested with additional data frame manipulation.

However, this seems NOT to be true. In fact the rowMeans - read.table approach seems to be by far the fastest.

Data

df <- structure(list(country = c("US", "AUS", "NZ"), gdp = c(100, 50, 
40), digits = c(2657, 123, 11)), class = "data.frame", row.names = c(NA, 
-3L))

Benchmark code

set.seed(42)
evav <- sample(1:1e5, size=1e4)

library(stringr)  # for str_extract_all
library(gsubfn)  # for strapply
microbenchmark::microbenchmark(mnDigit2=sapply(evav, mnDigit2),
                               mnDigit=sapply(evav, mnDigit2),
                               stringr=sapply(str_extract_all(evav, ".{1}"), function(x) mean(as.numeric(x))),
                               strsplit=mean_digits(evav),
                               rowMeans=rowMeans(read.table(text = gsub("\\b", " ", evav), fill = NA), na.rm = TRUE),
                               strapply=sapply(strapply(evav, ".", as.numeric, simplify=TRUE), mean),
                               split_based=sapply(evav, split_based),
                               arithmetic_based=sapply(evav, arithmetic_based),
                               times=10L,
                               control=list(warmup=10L))
# see `mean_digits` `split_based` & `arithmetic_based` functions in other answers
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    Well, your function does not give the correct results: `mnDigit(1000) #0` but should be `0.25` – thothal Jun 11 '19 at 12:58
  • 1
    @thothal Interesting, rolled back to first version which works. – jay.sf Jun 11 '19 at 13:06
  • Thanks for including the benchmark. I think it is really helpful!! However, wouldn't it be wise to scale up `df$digits` for benchmarking (opposed to n=3)? And, if benchmarking, I think the `stringr`_matrix solution is probably quicker. – Andrew Jun 11 '19 at 13:13
  • 1
    @jay.sf issue was that `ceiling(log10(1000)) ` is exactly `3`. Easy solution is to use `ceiling(log10(a + 1))`. – thothal Jun 11 '19 at 13:15
  • @thothal thanks, I'll rewrite the function. And already taken account for the tidyverses. – jay.sf Jun 11 '19 at 13:17
  • @G.Grothendieck Done so, any more suggestions? – jay.sf Jun 11 '19 at 13:33
  • @G.Grothendieck Added benchmark of a fresh R session, also taken your update into account which performs a little better. – jay.sf Jun 11 '19 at 13:47
  • If I copy and paste that into a fresh session I get: Error in match.fun(FUN) : object 'mnDigit2' not found – G. Grothendieck Jun 11 '19 at 13:51
  • @G.Grothendieck I'm not sure exactly what you're after, but you can find `mnDigit2` in my answer. – jay.sf Jun 11 '19 at 13:55
  • I was after something that is complete and self contained so it can be easily reproduced. – G. Grothendieck Jun 11 '19 at 13:58
  • 1
    (a) My `mean_digits` function uses `sapply` internally, you don't need to wrap it in `sapply`, that just slows it down. (b) If you're gonna bother to run a benchmark, better to do it 10 times on input of size 500k - 1M (input big enough for timing to matter) rather than 100 times on input of length 3. I'd recommend re-running on input such as `x = sample(1:1e7, size = 5e5)` – Gregor Thomas Jun 11 '19 at 14:02
  • @Gregor Done so. However, that nobody has to wait 7 years for the benchmark, we are satisfied with a little less numbers :) The results are indeed amazingly different. Thanks all for this discussion. – jay.sf Jun 11 '19 at 15:07
  • Yes, microbenchmark is nice enough to give LQ and UQ (and min and max) to estimate the spread as well, so running 100 times is almost never needed---even 5 times is usually plenty to tell any difference that matters. I do question your `rowMeans` numbers, Andrew beat you to an updated benchmark and their `rowMeans` placed in a respectable 2nd, not a runaway first. – Gregor Thomas Jun 11 '19 at 15:14
  • @Gregor I can replicate the 1st place though (`all.equal` yields `TRUE`), probably machine dependent? However there may be slightly different results on a "clean" machine running nothing but OS and R. – jay.sf Jun 11 '19 at 15:22
2

1) strapply This one-liner uses strapply in gsubfn. It converts each digit to numeric and then takes the mean of each.

library(gsubfn)

transform(df, mean = sapply(strapply(digits, ".", as.numeric, simplify = TRUE), mean))

2) This is a little longer but still one statement and uses no packages. It inserts a space between digits, reads them using read.table and then applies rowMeans.

transform(df, 
  mean = rowMeans(read.table(text = gsub("\\b", " ", digits), fill = NA), na.rm = TRUE))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341