How to set up a fast custom german - english dictionary

Question

As my input is frequently in german but I want the code to be pure english I would like to have a short custom dictionary - consisting basically of weekday- and months abbreviations. Thus, I want to create a fast english-german (and vise versa) dictionary - ideally as an environment with parent environment = .GlobalEnv. But when I put the code in a function, the dict_g2e dictionary is not known any more.

 set_dict <- function() { # Delete this line and ...
   dict_g2e <- new.env(hash = TRUE, size = 7)
   from <- c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa")
   to <- c("Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat")
   for (i in 1:19) {
     assign(x = from[i], value = to[i], envir = dict_g2e)
   } # this line and the code is working as expected

Test:

> get("So", env = dict_g2e) # ran without the set_dict <- function() {...} part
[1] "Sun"

Where is the bug?
I would do the same with dict_e2g. Is there a faster & shorter way to do this?
Is there a better command than get("So", env = dict_g2e)? Is there any argument against g2e <- function(wd) {get(wd, envir = dict_g2e)}

Edit after comments from @Roland and @alexis_laz:

df_dict <- function() {
  df <- data.frame(german = c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"),
    english = c("Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"),
    stringsAsFactors = F)
  return(df)
}
df <- df_dict()

df_g2e <- function(wd) {
  df$english[which(df$german == wd)]
}

The microbenchmark:

print(summary(microbenchmark::microbenchmark(
  g2e("So"),
  df_g2e("So"),
  times = 1000L, unit = "us")))
}

And the result:

       expr    min     lq      mean median     uq    max neval
   g2e("So")  1.520  2.280  2.434178  2.281  2.661 17.106  1000
df_g2e("So") 12.545 15.205 16.368450 15.966 16.726 55.500  1000

did you have a look at this ? http://stackoverflow.com/questions/16347731/how-to-change-the-locale-of-r-in-rstudio — user5249203, Jun 20 '16 at 12:43
@user5249203: My question is not about preset translations of error messages and so on. So I think that is not what I am looking for (excpet there is somthing in-between the lines which I didn't get). — Christoph, Jun 20 '16 at 12:47
I don't get this. Why don't you simply use a named vector? `wdays <- setNames(c("Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"), c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa")); wdays["Mi"]` — Roland, Jun 20 '16 at 12:57
(You're missing a closing `}` in `set_dict`) `set_dict` does not return `dict_g2e` (it implicitly returns `NULL` as the last evaluation is a `for` loop that returns `NULL` -- `help("for")`); you need to `return(dict_g2e)` and also save it to a variable: `dict_g2e = set_dict()`. If you want to follow a -not so usual in R- path of side effects / assigning to global environment from within a function, you -just- need to use `dict_g2e <<- new.env(hash = TRUE, size = 7)`. Though for such cases the most usual approach would be to use a "data.frame" with 'from' and 'to' and use `to[match(x, from)]` — alexis_laz, Jun 20 '16 at 13:01
@Roland: The vector approach is much slower and I thought I could use a simple and elegant hash-approach using environments. — Christoph, Jun 20 '16 at 13:01
@alexis_laz: You are right, that solved the problem. I still don't understand why you would use `data.frame`. See my edits above. (You can answer at least that part of the question;-) — Christoph, Jun 20 '16 at 13:24
@Christoph : In your edit you don't need the `english[which(german == x)]` -- either a `english[match(x, german)]` or subsetting with "character" a, as Roland notes, named vector avoids linear searches. Also, note, that `get` will return error if not found, although you could replace with `mget` which, also, handles > 1 queries. Aside that, I find the setup of a dictionary with a named vector or 2 vectors ("data.frame") more straigthforward. I see, though, that, depending on the use case, multiple accesses to an `environment` can be more efficient. — alexis_laz, Jun 20 '16 at 13:39

Roland · Accepted Answer · 2016-06-20T13:37:29.357

You could use a closure:

dict <- function() { # Delete this line and ...

  dict_g2e <- new.env(hash = TRUE, size = 7)
  from <- c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa")
  to <- c("Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat")
  for (i in 1:19) {
    assign(x = from[i], value = to[i], envir = dict_g2e)
  }
  function(from) {
    dict_g2e[[from]]
  }
}

wdays1 <- dict()
wdays1("So")
#[1] "Sun"

However, vector subsetting is faster:

wdays2 <- setNames(c("Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"), 
                   c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"))

And defining the environment in the global environment is faster still:

wdays3 <- list2env(as.list(wdays2), hash = TRUE)

library(microbenchmark)
microbenchmark(for (i in seq_len(1e3)) wdays1("Mi"), 
               for (i in seq_len(1e3)) wdays2[["Mi"]], 
               for (i in seq_len(1e3)) wdays3[["Mi"]])

#Unit: microseconds
#                                    expr     min      lq     mean   median       uq      max neval cld
#   for (i in seq_len(1000)) wdays1("Mi") 434.045 488.205 520.6626 507.0265 516.2455 2397.108   100   c
# for (i in seq_len(1000)) wdays2[["Mi"]] 182.324 211.005 214.6720 215.9985 217.9190  239.173   100  b 
# for (i in seq_len(1000)) wdays3[["Mi"]] 141.609 164.143 167.1088 168.2410 169.7770  190.007   100 a

However, there is a clear advantage to the vector approach: It is vectorized.

wdays2[c("So", "Do")]
#     So      Do 
#  "Sun" "Thurs"

If you want to translate in both directions, using a data.frame would be the natural approach, but data.frame subsetting is rather slow. You could use two named vectors instead, one for each direction.

I didn't know `setNames`. Furthermore, the difference between vectors and environments is now very clear - sometimes you didn't get the obvious;-) — Christoph, Jun 20 '16 at 13:49

How to set up a fast custom german - english dictionary

1 Answers1