0

Currently, I'm having an issue with computation time because I run a triple for loop in R to create anomaly thresholds on the day of the week and hour level for each unique ID.

My original data frame: Unique ID, Event Date Hour, Event Date, Event Day of Week, Event Hour, Numeric Variable 1, Numeric Variable 2, etc.

df <- read.csv("mm.csv",header=TRUE,sep=",")

for (i in unique(df$customer_id)) {
  #I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
  output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)

  for (k in unique(df$event_day_of_wk)) {
    for (z in unique(df$event_hr)) {
      merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds

      #1st anomaly threshold - I have multiple different anomaly thresholds

      # TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
      for(j in names(merchant.df)){
        merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
        #merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
        #merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
        #merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
      }

      mu_vector        = apply( merchant.df, 2, mean )
      sigma_matrix     = cov( merchant.df, use="complete.obs", method='pearson' )
      inv_sigma_matrix = ginv(sigma_matrix)
      det_sigma_matrix = det( sigma_matrix )

      z_probas = apply( merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix )
      eps = quantile(z_probas,0.01)
      mv_outliers = ifelse( z_probas<eps, TRUE, FALSE )

      #2nd anomaly threshold
      nov = ncol(merchant.df)
      pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
      pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
      lambda <- pca_result$eig[, 'eigenvalue']
      anomaly_score = (as.matrix(pca_result$ind$coord) ^ 2) %*% (1 / as.matrix(lambda, ncol = 1))
      significance <- c (0.99)
      thresh = qchisq(significance, nov)
      pca_outliers = ifelse( anomaly_score > thresh , TRUE, FALSE )

      #This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
      temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
      output.df <- rbind(output.df, temp.output.df)
     }
    }
   #Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
   write.csv(output.df,row.names=FALSE)
   }

The following code shows the idea of what I'm doing. As you can see I run 3 for loops where I calculate multiple anomaly detections at the lowest grain which is the hour level by day of the week, then once I finish I output every unique customer_id level into a csv.

Overall the code runs very fast; however, doing a triple for loop is killing my performance. Does anyone know any other way I can do an operation like this given my original data frame and having the need to output a csv at every unique_id level?

Wayne Lee
  • 46
  • 4
  • 4
    Without seeing your actual code it's hard to know how to optimize it. In particular, none of `i`, `k` and `z` are used inside your loop, as written. Please provide a [mcve]. See also http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . –  Mar 28 '17 at 17:40
  • Thanks for the quick reply I forgot the most important line. merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds I use i, k, and z to subset the main df to obtain data for that particular unique id, day of week, and hour. Then at that level I do all my calculations. – Wayne Lee Mar 28 '17 at 18:07
  • I guess something like groupBy (aggregate) should help – dk14 Mar 28 '17 at 18:08
  • 1
    So don't use a triple-loop. Use `dplyr::group_by(customer_id, event_day_of_wk, event_hr)`, or data.table equivalent. (How many rows is `merchant.df`? One? >1? Varies?) – smci Mar 28 '17 at 18:10
  • 1
    @WayneLee Please put additional information in your question, i.e. edit your question: http://stackoverflow.com/posts/43076157/edit – jogo Mar 28 '17 at 18:10
  • Thank you for the quick response. Would dplyr:group_by be able to handle additional calculations I am doing within my for the loop. Also, how would it write a CSV file for each unique customer_id? original df is around 181k rows while merchant.df ends up being only 52 rows because it looks at a unique_id then a day of the week and then a specific hour. At the end it does vary because not every unique_id has the same length of history. jogo thank you for your comment, I've added more details to the code. Please let me know if there is anything else I should add. – Wayne Lee Mar 28 '17 at 18:21
  • Yes, the other dplyr verbs can do what your code currently does. I sketched you a pseudocode answer below. If you post a new question with reproducible data *("How do I rewrite this triple-loop with dplyr?")* I could do better. The main point is doing `cbind` and `rbind` to every iteration to append to your output will kill your performance. Also, you don't need to `cbind()` your entire input df into your output df; you could `join()` the input and output dfs later on `customer_id, event_day_of_wk, event_hr`. – smci Mar 29 '17 at 19:59
  • EDIT: since you want to collate intermediate results for each `customer_id` then `write.csv()` those results, that needs to go in a separate outer level of grouping. See my answer. You'll probably find it easier to write your code by just doing `%>% filter(customer_id=)` – smci Mar 29 '17 at 20:23

1 Answers1

1
  • So don't use a triple-loop. Use dplyr::group_by(customer_id, event_day_of_wk, event_hr), or the data.table equivalent. Both should be faster.
  • No need for explicit appending on every iteration with rbind and cbind which will kill your performance.
  • Also, no need to cbind() your entire input df into your output df; your only actual outputs are mv_outliers, pca_outliers; you could join() the input and output dfs later on customer_id, event_day_of_wk, event_hr
  • EDIT: since you want to collate all results for each customer_id then write.csv() them, that needs to go in an outer level of grouping, and group_by(event_day_of_wk, event_hr) in the inner level.

.

# Here is pseudocode, you can figure out the rest, do things incrementally
# It looks like seller_name, is_anomaly_date, event_date_hr, event_day_of_wk, event_day,... are variables from your input

require(dplyr)

output.df <- df %>%
  group_by(customer_id) %>%
    group_by(event_day_of_wk, event_hr) %>%

    # columns 10:19 ('foo','bar','baz'...) are the 9 different numeric variables I am creating anomaly thresholds
    # Either a) you can hardcode their names in mutate(), summarize() calls
    #  or b) you can reference the vars by string in mutate_(), summarize_() calls

    # TRANSFORM VARIABLES
    mutate(foo_log = log1p(foo), bar_log = log1p(bar), ...) %>%

    mutate(mu_vector = c(mean(foo_log), mean(bar_log)...) ) %>%
    # compute sigma_matrix, inv_sigma_matrix, det_sigma_matrix ...

    summarize(
       z_probas=mv_gaussian(mu_vector, det_sigma_matrix, inv_sigma_matrix),
       eps = quantile(z_probas,0.01),
       mv_outliers = (z_probas<eps)
    ) %>%

    # similarly, use mutate() and do.call() for your PCA invocation...

    # Your outputs are mv_outliers, pca_outliers
    # You don't necessarily need to `cbind(merchant.df, mv_outliers, pca_outliers)` i.e. cbind all your input data together with your output

    # Now remove all your temporary variables from your output:
    select(-foo_log, -bar_log, ...) %>%
    # or else just select(mv_outliers, pca_outliers) the variables you want to keep

  ungroup() %>%  # (this ends the group_by(event_day_of_wk, event_hr) and cbinds all the intermediate dataframes for you)

  write.csv( c(.$mv_outliers, .$pca_outliers), file='<this_customer_id>.csv')

ungroup()  # group_by(customer_id)

See also "write.csv() in dplyr chain"

Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • Would dplyr:group_by be able to handle additional calculations I am doing within my for loop? For example, I do variable transformation and other functions. – Wayne Lee Mar 29 '17 at 05:00
  • @WayneLee: yes `dplyr` can handle everything in your code above. See the doc for `dplyr::mutate(), summarize()`. (So can `data.table`) – smci Mar 29 '17 at 19:18