Currently, I'm having an issue with computation time because I run a triple for loop in R to create anomaly thresholds on the day of the week and hour level for each unique ID.
My original data frame: Unique ID, Event Date Hour, Event Date, Event Day of Week, Event Hour, Numeric Variable 1, Numeric Variable 2, etc.
df <- read.csv("mm.csv",header=TRUE,sep=",")
for (i in unique(df$customer_id)) {
#I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)
for (k in unique(df$event_day_of_wk)) {
for (z in unique(df$event_hr)) {
merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds
#1st anomaly threshold - I have multiple different anomaly thresholds
# TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
for(j in names(merchant.df)){
merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
#merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
#merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
#merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
}
mu_vector = apply( merchant.df, 2, mean )
sigma_matrix = cov( merchant.df, use="complete.obs", method='pearson' )
inv_sigma_matrix = ginv(sigma_matrix)
det_sigma_matrix = det( sigma_matrix )
z_probas = apply( merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix )
eps = quantile(z_probas,0.01)
mv_outliers = ifelse( z_probas<eps, TRUE, FALSE )
#2nd anomaly threshold
nov = ncol(merchant.df)
pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
lambda <- pca_result$eig[, 'eigenvalue']
anomaly_score = (as.matrix(pca_result$ind$coord) ^ 2) %*% (1 / as.matrix(lambda, ncol = 1))
significance <- c (0.99)
thresh = qchisq(significance, nov)
pca_outliers = ifelse( anomaly_score > thresh , TRUE, FALSE )
#This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
output.df <- rbind(output.df, temp.output.df)
}
}
#Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
write.csv(output.df,row.names=FALSE)
}
The following code shows the idea of what I'm doing. As you can see I run 3 for loops where I calculate multiple anomaly detections at the lowest grain which is the hour level by day of the week, then once I finish I output every unique customer_id level into a csv.
Overall the code runs very fast; however, doing a triple for loop is killing my performance. Does anyone know any other way I can do an operation like this given my original data frame and having the need to output a csv at every unique_id level?