General question about vectorization and parallelization in R

Question

I am new with R, but i would like to understand and produce fast code with TensorFlow in Rstudio. I understand the concept of parallelization, but i am having some problems to understand the differences among these concepts: parallelization, vectorization and tensorizing (sorry for my english). I would like some trivial examples to understand these differences. Can i apply all these concepts simultaneously?

Vectorized: `vec <- (1:10)^2`. Not vectorized: `vec <- 1:10; for (i in seq_along(vec)) vec[i] <- vec[i]^2;`. Not vectorized: `vec <- sapply(1:10, function(x) x^2)`. — r2evans, May 01 '20 at 17:57

niko · Answer 1 · 2020-05-01T21:46:58.660

Here are my two cents on the parallelization and vectorization in R. I will not address tensorization as I do not have much experience with TensorFlow. However, having a background in differential geometry, my best guess would be it means using tensors, that is higher-dimensional (data-) structures, to tackle certain problems.

Parallelization

The basic idea of parallelization is running tasks simultaneously. Often time, especially when implemented in R, this concept is handled via multiprocessing: typically this will distribute the tasks to the computer's CPUs (or threads see multi-threading or check this great SO answer). In addition, parallelization could be regarded as one way to tackle concurrency: the latter has other implementations though such as asynchronous programming.

The typyical example for parallelization (and also for concurrency) is the following: Assume you have a list of URLs url1, url2, ... and you need to send a GET request (and wait for the response) to each one of them. The classical (synchronous) way would be to iterate through all URLs, make the GET request, wait for the response, then (and only then) proceed with the next URL.

# Dummy example list
urls <- rep('http://example.com', 7)
# Fetching the data
results <- rep(list(NA), length(urls))
for (k in seq_along(urls))
  results[[k]] <- httr::GET(urls[k])

The reason why this is a classical example is that (usually) these requests are independent from each other: theoretically, we do not have to wait for the first response before making the second request. So we could send those requests simultaneously:

# Parallel
urls <- rep('http://example.com', 7)
num_cores <- parallel::detectCores() - 1
cl <- parallel::makeCluster(num_cores)
parallel::clusterEvalQ(cl, library(httr))
parallel::clusterExport(cl, varlist = c('urls'))
results <- parallel::parLapply(cl, urls, httr::GET)
parallel::stopCluster(cl)

In the above code, most lines are about setting things up, but the crucial line is the second to last one: this is where we are distributing and executing the tasks across the different cores (CPUs) available.

In essence, parallelization is closely tied to tasks and time.

Vectorization

This topic is much more straightforward. The language R is inherently optimized for vectorized operations: Vectors, matrices and arrays are builtins in R - this is not the case for any language.

In addition, operations and functions are vectorized as well: For example R supports division of vectors 1:5 / 11:15 and mostly behaves as one might expect it (the typical pitfalls are well documented e.g. the 1:5 + 11:20). Python for example has lists as builtins but does not (inherently) support vectorization: Something like range(5) / range(11, 15) will throw an error (yes there are libraries which will make this feasible).

This is not black magic though: when doing paste0("url_", 1:5) a loop occurs in the lower level language C which makes it orders of magnitude faster than looping in R. This is also why loops tend to have a bad reputation in R (eventhough proper looping is absolutely fine). Here is a very naive illustration

microbenchmark::microbenchmark(
  loop = {
    v1 <- 1:5; v2 <- 6:10
    result <- rep(NA, length(v1))
    for (k in seq_along(v1))
      result[k] <- v1[k] + v2[k]
  },
  vectorization = {
    result <- 1:5 + 6:10
  }
)
# Unit: nanoseconds
#          expr     min      lq       mean  median      uq     max neval cld
#          loop 1367900 1377052 1431076.98 1396951 1407551 4317901   100   b
# vectorization     400     501    1145.95    1500    1601    4001   100  a

The bottom line here is that vectorization is around (mostly) every corner in R and it is in my humble opinion the essence of R's elegance and its most beautiful feature (custom operators are pretty neat as well). For example one can create the sin's Taylor Expansion with a one-liner:

f <- function(x, n = 10) sum((-1)^(0:n) * x^(2*(0:n) + 1) / factorial((2*(0:n) + 1)))
f(0)
# 0
f(pi)
# 1.034819e-11

Finally, beyond aesthetics, R's vectorization is a philosophy, a way to approach problems: Given a specific task, when coding in R you will always try to come up with a vectorized solution to the problem whereas in other languages you would just apply loops on top of loops.

Thank you, it's an awesome answer. Now i get these two concepts: vectorization and parallelization . But i still don't understand the purpose of tensioning. — vrige, May 02 '20 at 09:20

General question about vectorization and parallelization in R

1 Answers1

Parallelization

Vectorization