0

I am stuck in simple problem. I have a scatter plot. I am plotted confidence lines around it using my a custom formula. Now, i just want only the names outside the cutoff lines to be displayed nothing inside. But, I can't figure out how to subset my data on the based of the line co-ordinates.

The line is plotted using the lines function which is a vector of 128 x and y values. Now, how do I subset my data (x,y points) based on these 2 values. I can apply a static limit of a single number of sub-setting data like 1,2 or 3 but how to use a vector to subset data, got me stuck.

enter image description here

For an reproducible example, consider :

df=data.frame(x=seq(2,16,by=2),y=seq(2,16,by=2),lab=paste("label",seq(2,16,by=2),sep=''))
plot(df[,1],df[,2])

# adding lines
lines(seq(1,15),seq(15,1),lwd=1, lty=2)

# adding labels
text(df[,1],df[,2],labels=df[,3],pos=3,col="red",cex=0.75)

Now, I need just the labels, which are outside or intersecting the line.

What I was trying to subset my dataframe with the values used for the lines, but I cant make it right.

Now, static sub-setting can be done for single values like df[which(df[,1]>8 & df[,2]>8),] but how to do it for whole list.

I also tried sapply, to cycle over all the values of x and y used for lines on the df iteratively, but most values become +ve for a limit but false for other values.

enter image description here Thanks

Sukhi
  • 826
  • 1
  • 8
  • 19
  • What have you tried so far? A [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) would be nice as wel. This will make it much easier for others to help you. – Jaap Nov 25 '14 at 14:38
  • Right, I will add it quickly. Thanks – Sukhi Nov 25 '14 at 14:41

1 Answers1

1

I will speak about your initial volcano-type-graph problem and not the made up one because they are totally different.

So I really thought this a lot and I believe I reached a solid conclusion. There are two options: 1. You know the equations of the lines, which would be really easy to work with. 2. You do not know the equation of the lines which means we need to work with an approximation.

Some geometry:

The function shows the equation of a line. For a given pair of coordinates (x, y), if y > the right hand side of the equation when you pass x in, then the point is above the line else below the line. The same concept stands if you have a curve (as in your case).

If you have the equations then it is easy to do the above in my code below and you are set. If not you need to make an approximation to the curve. To do that you will need the following code:

df=data.frame(x=seq(2,16,by=2),y=seq(2,16,by=2),lab=paste("label",seq(2,16,by=2),sep=''))

make_vector <- function(df) {  
lab <- vector()
for (i in 1:nrow(df)) {
  this_row <- df[i,]  #this will contain the three elements per row
  if ( (this_row[1] < max(line1x) & this_row[2] > max(line1y) & this_row[2] < a + b*this_row[1]) 
        |
        (this_row[1] > min(line2x) & this_row[2] > max(line2y) & this_row[2] > a + b*this_row[1]) ) {
    lab[i] <- this_row[3]
  } else {
    lab[i] <- <NA>
  }
}  
return(lab)
}
#this_row[1] = your x
#this_row[2] = your y
#this_row[3] = your label



df$labels <- make_vector(df)


plot(df[,1],df[,2])

# adding lines
lines(seq(1,15),seq(15,1),lwd=1, lty=2)

# adding labels
text(df[,1],df[,2],labels=df[,4],pos=3,col="red",cex=0.75)

The important bit is the function. Imagine that you have df as you created it with x,y and labs. You also will have a vector with the x,y coordinates for line1 and x,y coordinates for line2.

Let's see the condition of line1 only (the same exists for line 2 which is implemented on the code above):

this_row[1] < max(line1x) & this_row[2] > max(line1y) & this_row[2] < a + b*this_row[1]
#translates to:
#this_row[1] < max(line1x) = your x needs to be less than the max x (vertical line in graph below
#this_row[2] > max(line1y) = your y needs to be greater than the max y (horizontal line in graph below
#this_row[2] < a + b*this_row[1] = your y needs to be less than the right hand side of the equation (to have a point above i.e. left of the line) 
#check below what the line is

This will make something like the below graph (this is a bit horrible and also magnified but it is just a reference. Visualize it approximating your lines):

enter image description here

The above code would pick all the points in the area above the triangle and within the y=1 and x=1 lines.

Finally the equation:

Having 2 points' coordinates you can figure out a line's equation solving a system of two equations and 2 parameters a and b. (y = a +bx by replacing y,x for each point)

The 2 points to pick are the two points closest to the tangent of the first line (line1). Chose those arbitrarily according to your data. The closest to the tangent the better. Just plot the spots and eyeball.

Having done all the above you have your points with your labels (approximately at least).

And that is the only thing you can do!

Long talk but hope it helps.

P.S. I haven't tested the code because I have no data.

Community
  • 1
  • 1
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • No, the problem is the subsetting not the plotting. You did not get it right, you subsetted using the value 8, but instead there is a vector containing 15 elements. So, something like this `df$labels[which(df[,1]>seq(1,15) & df[,2]>seq(1,15))]` :) – Sukhi Nov 25 '14 at 15:36
  • I don't get the testing though. How would the test of the condition occur? each element of df[,1] against each element of the seq(1,15)? 1vs1? I think this is what you probably mean. this makes sense. – LyzandeR Nov 25 '14 at 15:50
  • No sorry I still dont get it.... You need to explain what you would want the test to be. – LyzandeR Nov 25 '14 at 16:00
  • Yes, thats right, that is my question how to how to approach this problem, how to test it. It can not be testing each element of df[,1] with seq(1,15) as all elements of df[,1] might be true for first element of seq(1,15) and it goes down with the increase in the value of the element. If you think this problem with the respect to first plot, it seems complicated. – Sukhi Nov 25 '14 at 16:02
  • Ok, I want to label points outside the curve not inside. Now the curve is a vector of points, and I want to subset my data frame with respect to curve points, any outside I want. So, for this we need limits, so either I convert the vector of points used to construct curves, into a single value limit and subset my dataset which is easy but not dynamic. How would you approach this? – Sukhi Nov 25 '14 at 16:05
  • If its a straight line, its easy to subset but with the curve, I am getting lost. – Sukhi Nov 25 '14 at 16:06
  • In your dataframe do you have the coordinates for both of your lines? i.e. an x value and a y value for each line and both lines having the same number of elements as the data you have? – LyzandeR Nov 25 '14 at 16:12
  • something like a dataframe with the following columns: x,y,lab,line1x,line1y,line2x,line2y ??? This question is for your original plot. The volcano looking like. – LyzandeR Nov 25 '14 at 16:15
  • NO,the dataframe can have more than 1000 elements(rows) but the lines constructed using x & y co-ordinates seperately can be less than 200 elements. – Sukhi Nov 25 '14 at 16:18
  • I feel that the only way for this to work is to produce as many datapoints as the rows of your dataframe. otherwise it is really difficult to do the checks... – LyzandeR Nov 25 '14 at 16:20
  • If you can't do that I would suggest replicating each row of the line values as many times as needed to match the rows of your dataframe. if you can do that I can provide a code to do the checking (just the code. I dont need any data). If you cannot do that then I cannot help I am afraid... – LyzandeR Nov 25 '14 at 16:26
  • I can produce as many datapoints as my dataframe, this is not a problem – Sukhi Nov 25 '14 at 16:30
  • Cool. I ll provide you with the code and a small example to replicate your problem. I need to go now but I ll do it in a few hours. – LyzandeR Nov 25 '14 at 16:32
  • Btw do you have the equations for the two lines? If you do it will be easy I think. – LyzandeR Nov 25 '14 at 19:24
  • The equations of the two lines of the volcano like graph – LyzandeR Nov 25 '14 at 19:54
  • I submitted a new answer. Hope it helps. – LyzandeR Nov 25 '14 at 22:07