3

I'm looking for a fast way to remove all dominated rows from a table (preferably using parallel processing, to take advantage of multiple cores).

By "dominated row", I mean a row that is less than or equal to another row in all columns. For example, in the following table:

tribble(~a, ~b, ~c,
        10,  5,  3,
        10,  4,  2,
         1,  4,  1,
         7,  3,  6)

Rows 2 and 3 are dominated rows (in this case they are both dominated by row 1), and should be removed. Rows 1 and 4 are not dominated by any other row and should be preserved, resulting in this table:

tribble(~a, ~b, ~c,
        10,  5,  3,
         7,  3,  6)

To further illustrate, here is the kind of code I'm looking to speed up:

table1 = as_tibble(replicate(3, runif(500000)))
colnames(table1) = c("a", "b", "c")
table2 = table1
for (i in 1:nrow(table1)) {
  table2 = filter(table2,
    (a > table1[i,]$a | b > table1[i,]$b | c > table1[i,]$c) |
    (a == table1[i,]$a & b == table1[i,]$b & c == table1[i,]$c) )
}
filtered_table = table2

I have some ideas, but figured I'd ask if there might be well-known packages/functions that do this.


UPDATE: Here is a fairly simple parallelization of the above code, which nevertheless provides a solid performance boost:

remove_dominated = function(table) {
  ncores = detectCores()
  registerDoParallel(makeCluster(ncores))
  # Divide the table into parts and remove dominated rows from each part
  tfref = foreach(part=splitIndices(nrow(table), ncores), .combine=rbind) %dopar% {
    tpref = table[part[[1]]:part[[length(part)]],]
    tp = tpref
    for (i in 1:nrow(tpref)) {
      tp = filter(tp,
                (a > tpref[i,]$a | b > tpref[i,]$b | c > tpref[i,]$c |
                (a == tpref[i,]$b & b == tpref[i,]$b & c == tpref[i,]$c) )
    }
    tp
  }
  # After the simplified parts have been concatenated, run a final pass to remove dominated rows from the full table
  t = tfref
  for (i in 1:nrow(tfref)) {
    t = filter(t,
            (a > tfref[i,]$a | b > tfref[i,]$b | c > tfref[i,]$c |
            (a == tfref[i,]$a & b == tfref[i,]$b & c == tfref[i,]$c) )
  }
  return(t)
}
kartik_subbarao
  • 228
  • 3
  • 15
  • 6
    Please define "dominated row" in words - is it a row that is strictly less than another row in all columns? – Gregor Thomas Jun 19 '18 at 21:11
  • 1
    By "dominated row", I mean a row that is less than or equal to another row in all columns. For example, if row 1 has columns A=10 B=5 C=3, row 2 has A=10 B=4 C=2, and row 3 has A=1 B=4 C=1, rows 2 and 3 are dominated by row 1. I want rows 2 and 3 to be removed. – kartik_subbarao Jun 19 '18 at 21:15
  • There are no duplicate rows in my particular data set, but if there were, duplicate rows should be collapsed into a single row. – kartik_subbarao Jun 19 '18 at 21:17
  • (1) *"For example, if row 1 has columns A=10 B=5 C=3, row 2 has A=10 B=4 C=2, and row 3 has A=1 B=4 C=1, rows 2 and 3 are dominated by row 1. I want rows 2 and 3 to be removed."* Huh? By that logic you would end up with exactly one row -- the one with the highest values across all columns. (2) Also: How would you deal with `A=10 B=5 C=3` and `A=5 B=10 C=3`? Which one is higher? – Maurits Evers Jun 20 '18 at 00:53
  • @Maurits You gave your own counter-example, in your example (2) you wouldn't remove either. – Alexis Jun 20 '18 at 07:03
  • @Alexis Yes my first statement was somewhat rhetorical (clearly my "Huh?" didn't make that clear enough;-); I'm still waiting for a revised problem statement with a clear set of rules from OP, which unfortunately is still pending. As it stands right now, OPs question is *entirely* unclear to me. This might just be me, as you seem to have provided an answer/solution. – Maurits Evers Jun 20 '18 at 08:08
  • I have added more explanation to the question. Basically, I'm trying to weed out rows that I don't need to consider from a large table. Let's say I have a table of computers with columns for clock speed, number of cores, and RAM. I want to eliminate all computers that are obviously less capable than some other computer in the table, so that I'm only left with the computers where I need to make tradeoffs (e.g. one has a faster clock speed but another one has more cores, etc). – kartik_subbarao Jun 20 '18 at 17:45

1 Answers1

2

EDIT2: optimized version below.

I have a feeling that you can do better than this solution, but it is probably not so trivial. Here I just compare each row to every other row, I just do it in a way that memory utilization is reduced, but execution time complexity is almost quadratic in n (almost because the for-loop can terminate early)...

library(doParallel)

n <- 50000L
table1 <- replicate(3L, runif(n))

num_cores <- detectCores()
workers <- makeCluster(num_cores)
registerDoParallel(workers)

chunks <- splitIndices(n, num_cores)
system.time({
  is_dominated <- foreach(chunk=chunks, .combine=c, .multicombine=TRUE) %dopar% {
    # each chunk has many rows to be checked
    sapply(chunk, function(i) {
      a <- table1[i,]
      # this will check if any other row dominates row "i"
      for (j in 1L:n) {
        # no row should dominate itself
        if (i == j)
          next

        b <- table1[j,]
        if (all(b >= a))
          return(TRUE)
      }

      # no one dominates "a"
      FALSE
    })
  }
})

non_dominated <- table1[!is_dominated,]

I create chunks of parallel tasks so that each parallel worker has to process many rows when it is called, so that communication overhead is reduced. I do see a considerable parallelization speed-up in my system.

EDIT: if you do have duplicated rows, I would remove them beforehand with unique.


In this version we shuffle the indices of the rows that each worker has to process, due to the fact that each worker has to deal with different loads for each i, shuffling seems to help with load balance.

With ordering and min_col_val we can check only rows that definitely dominate row i in the column corresponding to ordering, and break out of the loop once that condition is violated. It does seem to be considerably faster in comparison.

ids <- sample(1L:n)
chunks <- lapply(splitIndices(n, num_cores), function(chunk_ids) {
  ids[chunk_ids]
})

system.time({
  orderings <- lapply(1L:ncol(table1), function(j) { order(table1[, j], decreasing=TRUE) })

  non_dominated <- foreach(chunk=chunks, .combine=c, .multicombine=TRUE, .inorder=FALSE) %dopar% {
    chunk_ids <- sapply(chunk, function(i) {
      a <- table1[i,]

      for (col_id in seq_along(orderings)) {
        ordering <- orderings[[col_id]]
        min_col_val <- a[col_id]

        for (j in ordering) {
          if (i == j)
            next

          b <- table1[j,]

          if (b[col_id] < min_col_val)
            break

          if (all(b >= a))
            return(FALSE)
        }
      }

      # no one dominates "a"
      TRUE
    })

    chunk[chunk_ids]
  }

  non_dominated <- table1[sort(non_dominated),]
})
Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Thanks @Alexis. This is along the lines of what I was thinking about as a next step, thanks for saving me time -- multidplyr isn't well suited for this, and I hadn't looked at doParallel in a while :-) Beyond that, I was thinking about dividing the table into subtables, culling dominated rows in each of the subtables in parallel, recombining the subtables into a top-level table, and repeating this divide-cull-recombine process until the top-level table was small enough (?), and then running a final cull on the top-level table. But that added complexity may or may not buy much :-) – kartik_subbarao Jun 20 '18 at 18:33
  • 1
    Ok, I think I managed to make an optimized version, but can't guarantee it 100%. – Alexis Jun 20 '18 at 21:24
  • Clever approach! I like how it shortcuts unnecessary comparisons. – kartik_subbarao Jun 21 '18 at 00:08
  • 1
    BTW, I recently realized that [pareto optimization](https://stackoverflow.com/questions/25585519/pareto-optimization-non-dominated-points) might be a better approach. – Alexis Jun 25 '18 at 21:05
  • Cool, thanks for the pointer. Also -- I noticed that with my data set, the quadratic-parallel approach is actually slower than my original serial approach. Your second approach would likely be much better, but as I was thinking about it some more I came up with another idea which is relatively simple code-wise, but provides a solid performance boost. I'll append it to my question. – kartik_subbarao Jun 26 '18 at 01:43