11

I know there must be an easy answer to this but somehow I can't seem to find it...

I have a data frame with 2 numeric columns. I would like to remove from it, the rows, which have the property, that there exists at least one other row in the data frame, with both column values bigger than the ones in this row.

So if I have

    Col1 Col2  
1     2    3  
2     4    7  
3     5    6  

I would like to remove the first row, because the second one fulfills the property and keep only rows 2 and 3.

Thanks a lot!

Jens A. Koch
  • 39,862
  • 13
  • 113
  • 141
user1184143
  • 111
  • 4
  • I can't set up the edit because it's only spaces, but your table could benifit from using the code format: put an extra 4 spaces before each line, and it will come out with the same formatting you used, and make it more readable. – Jeff Feb 02 '12 at 02:41
  • Thanks Jeff, I tried to figure out how to do that but I failed miserably. – user1184143 Feb 02 '12 at 04:01

6 Answers6

21

That problem is called a "skyline query" by database administrators (they may have other algorithms) and an "efficient frontier" by economists. Plotting the data can make it clear what we are looking for.

n <- 40
d <- data.frame(
  x = rnorm(n),
  y = rnorm(n)
)
# We want the "extreme" points in the following plot
par(mar=c(1,1,1,1))
plot(d, axes=FALSE, xlab="", ylab="")
for(i in 1:n) {
  polygon( c(-10,d$x[i],d$x[i],-10), c(-10,-10,d$y[i],d$y[i]), 
  col=rgb(.9,.9,.9,.2))
}

The algorithm is as follows: sort the points along the first coordinate, keep each observation unless it is worse than the last retained one.

d <- d[ order(d$x, decreasing=TRUE), ]
result <- d[1,]
for(i in seq_len(nrow(d))[-1] ) {
  if( d$y[i] > result$y[nrow(result)] ) {
    result <- rbind(result, d[i,])  # inefficient
  } 
}
points(result, cex=3, pch=15)

Skyline

Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Yes, that's exactly what I wanted to do but I thought doing it the "C" way was not appropriate, thanks a bunch! – user1184143 Feb 02 '12 at 04:03
  • +1 and thanks for the great answer. I especially appreciate your making the connection to the skyline query and then including a plot to illustrate why it's got that name! I've added an answer below, inspired by yours that replaces that `for()` loop with a more R-ish vectorized construct. – Josh O'Brien Feb 02 '12 at 17:07
  • I didn't really appreciate the ingenuity of the algorithm the first time I read it, thanks again! – user1184143 Feb 04 '12 at 01:35
  • What a fantastic explanation! – ECII Jul 03 '12 at 07:28
  • Thanks for that! I think this is also called "pareto optimality": a state of allocation of resources in which it is impossible to make any one individual better off without making at least one individual worse off [Wikipedia "pareto efficiency a state of allocation of resources in which it is impossible to make any one individual better off without making at least one individual worse off"], and the highlighted data represents the "Pareto front". – Chris Jul 23 '14 at 17:50
  • It's interesting how close this is to `chull`, the convex hull computation. However the second big point shows an example of how the `chull` would be insufficient for this criteria. – geneorama Oct 11 '19 at 16:36
9

Edit (2015-03-02): For a more efficient solution, please see Patrick Roocks' rPref, a package for "Database Preferences and Skyline Computation", (also linked to in his answer below). To show that it finds the same solution as my code here, I've appended an example using it to my original answer here.


Riffing off of Vincent Zoonekynd's enlightening response, here's an algorithm that's fully vectorized, and likely more efficient:

set.seed(100)
d <- data.frame(x = rnorm(100), y = rnorm(100))

D   <- d[order(d$x, d$y, decreasing=TRUE), ]
res <- D[which(!duplicated(cummax(D$y))), ]
#             x         y
# 64  2.5819589 0.7946803
# 20  2.3102968 1.6151907
# 95 -0.5302965 1.8952759
# 80 -2.0744048 2.1686003


# And then, if you would prefer the rows to be in 
# their original order, just do:
d[sort(as.numeric(rownames(res))), ]
#            x         y
# 20  2.3102968 1.6151907
# 64  2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759

Or, using the rPref package:

library(rPref)
psel(d, high(x) | high(y))
#             x         y
# 20  2.3102968 1.6151907
# 64  2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759
Patrick Roocks
  • 3,129
  • 3
  • 14
  • 28
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
5

Here is an sqldf solution where DF is the data frame of data:

library(sqldf)
sqldf("select * from DF a
 where not exists (
   select * from DF b
   where b.Col1 >= a.Col1 and b.Col2 >  a.Col2  
      or b.Col1 >  a.Col1 and b.Col2 >= a.Col2
 )"
)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
4

This question is pretty old, but meanwhile there is a new solution. I hope it is ok to do some self-promotion here: I developed a package rPref which does an efficient Skyline computation due to C++ algorithms. With installed rPref package the query from the question can be done via (assuming that df is the name of data set):

library(rPref)
psel(df, high(Col1) | high(Col2))

This removes only those tuples, where some other tuple is better in both dimensions.

If one requires the other tuple to be strictly better in just one dimension (and better or equal in the other dimension), use high(Col1) * high(Col2) instead.

Patrick Roocks
  • 3,129
  • 3
  • 14
  • 28
  • Hah! Just clicked through to your profile, saw the link to `rPref`, and came straight over here to make a note that there's now a better solution. Will add a note to my answer pointing folks down here, since otherwise your answer is pretty much lost in the weeds! – Josh O'Brien Mar 02 '15 at 17:23
  • Does this package deal with efficient frontiers that have more than two variables? Let's say we wanted to include n-variables – road_to_quantdom Mar 06 '15 at 00:24
  • Yes, it does. There is no principal limitation in the number of variables/dimensions, `high(x1) | high(x2) | ... | high(xn)` is also possible, but the computation costs (and hence the run time) will grow with the number of dimensions. – Patrick Roocks Mar 11 '15 at 08:15
2

In one line:

d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
d[!apply(d,1,max)<max(apply(d,1,min)),]

     [,1] [,2]
[1,]    4    7
[2,]    5    6

Edit: In light of your precision in jbaums' response, here's how to check for both columns separately.

d <- matrix(c(2, 3, 3, 7, 5, 6, 4, 8), nrow=4, byrow=TRUE)
d[apply(d,1,min)>min(apply(d,1,max)) ,]

     [,1] [,2]
[1,]    5    6
[2,]    4    8
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
  • Thanks however, I need to treat each column independently, for example if I have d <- matrix(c(1, 8, 7, 2), nrow=2, byrow=TRUE), both rows should be kept because the first row has the highest value in the second column and the second row has the highest value in the first column so there is no other row that has a higher value in **both** columns. – user1184143 Feb 04 '12 at 01:33
0
d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
d2 <- sapply(d[, 1], function(x) x < d[, 1]) & 
      sapply(d[, 2], function(x) x < d[, 2])
d2 <- apply(d2, 2, any)
result <- d[!d2, ]
jbaums
  • 27,115
  • 5
  • 79
  • 119
  • Hmm, I don't fully understand this but my feeling is that it doesn't look at each column separately. For instance if you do > d <- matrix(c(2, 3, 3, 7, 5, 6, 4, 8), nrow=4, byrow=TRUE) row (3, 7) should also be removed because (4, 8) is bigger in both columns. Thanks for replying though. – user1184143 Feb 02 '12 at 04:14
  • Right, I misinterpreted your requirements. I read it to mean "if each of the numbers in row B are larger than each of the numbers in row A, exclude row A". Because 4 and 8 are not each larger than 3 and 7 (i.e. 4 is not larger than 7), the criterion was not met, and the row not excluded. Now I understand you are interested in whether the values in row B are larger than the values in the corresponding columns of row A. I've edited the answer to correct it (I think). – jbaums Feb 02 '12 at 04:49
  • (This should now exclude those rows where another row exists that has larger values in each column. I believe Vincent's solution will excludes rows where another row exists that has larger *or equal* values in each column, although that may indeed be what you intended to write in your question..?) – jbaums Feb 02 '12 at 06:26