1

I have data that I am analysing for a lab and I'm trying to use R for the first time.

I've been reading about selecting rows based on conditions but I can't seem to find the way to do it for my data.

I made a data frame and I didn't name the columns. Each column is a particular variant of a bacterial species that I am testing and its increasing values of OD/absorbance (in total 56 rows for each column) over a period of about 15 hours.

I want to select the rows with values ranging from 0.2 to 0.4 from EACH column.

A section of my data frame

So ideally I want something like:

   V1       V2
9  0.2100  7 0.2181
10 0.3017  8 0.3162
11 0.4079  9 0.4137

etc.

I guess I can select the rows manually from each column but there must be a quicker way.

I then plan to calculate the mean of each column of the subset.

Any help would be much appreciated, thanks!

Heikki
  • 2,214
  • 19
  • 34
dumeir
  • 13
  • 4
  • 3
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos Jan 24 '18 at 13:16
  • You can try `lapply(df1, function(x) x[x > 0.2 & x < 0.4])` and keep it in a `list` as the number of elements in each column that agree with the condition may differ – akrun Jan 24 '18 at 13:23
  • If you only want those means, use `sapply(df, function(x) mean(x[x > .2 & x < .4]))`. This way you'll get a vector of length `ncol(df)` with the mean of the subset for each column. – LAP Jan 24 '18 at 13:45
  • @LAP Amazing, thank you so much! But how do I combine both of your answers? To be more specific, I only want the value closest to 0.3 and then the 2 values around it so around 0.2 and 0.4, but I only want 3 values. As akrun suggested, the number of elements is different in some columns. How do I ensure that the sapply mean is for 3 values? – dumeir Jan 24 '18 at 14:10

2 Answers2

0

in this example you get a list with a vector of values between 0.2 to 0.4 for each variable. Hope it helps

df=data.frame(V1=c(1,0.3,2,.1,.5,8,.1,.4,.35,.22,6),V2=c(0.2,0.3,3,.15,.32,5,.1,.45,.35,.3,6))
filteredColumns<-sapply(df,function(x) x[x>0.2&x<0.4])
Antonios
  • 1,919
  • 1
  • 11
  • 18
0

This will do it:

findNearest3 <- function(x, y, z){
  temp <- sort(x[x > z[1] & x < z[2]])
  point <- which(abs(temp-y)==min(abs(temp-y)))
  return(temp[c(point-1, point, point+1)])
}

The function will look for the nearest value to y within vector x, constrained by limits z, and return this value plus the one before and after within the sorted vector.

Example:

set.seed(123)
df <- data.frame(x = rnorm(100), y = rnorm(100))

sapply(df, findNearest3, .3, c(.2, .4))
             x         y
[1,] 0.2533185 0.2982276
[2,] 0.3035286 0.3011534
[3,] 0.3317820 0.3104807

Now with

sapply(df, function(x) mean(findNearest3(x, .3, c(.2, .4))))

you'll get the means:

        x         y 
0.2962097 0.3032872 

Be aware that this will return NA if there are not enough values within the given constrains z:

df <- data.frame(x = c(.1, .23, .35, .5), y = c(.22, .24, .33, .48))

> sapply(df, findNearest3, .3, c(.2, .4))
        x    y
[1,] 0.23 0.24
[2,] 0.35 0.33
[3,]   NA   NA

> sapply(df, function(x) mean(findNearest3(x, .3, c(.2, .4)), na.rm = T))
    x     y 
0.290 0.285 

Edit: To return the row positions of the values instead of the values themselves, just change the last line of the code:

findNearest3.pos <- function(x, y){
  temp <- sort(x)
  point <- which(abs(temp-y)==min(abs(temp-y)))
  return(c(point-1, point, point+1))
}

Application: To use it on another dataframe of the same dimensions, first save the positions in a list:

myrows <- lapply(df, findNearest3.pos, y = .3)

and then subset the second dataframe:

set.seed(234)
df1 <- data.frame(x = rnorm(100), y = rnorm(100))

newsubset <- mapply(function(x, y) x[y], df1, myrows)
              x        y
[1,] -0.9581388 2.214151
[2,]  0.6280635 0.455070
[3,]  0.6625872 0.513053

Considering the other dataframe with only one column, you need to decide which column's row position you want to use.

set.seed(345)
df2 <- data.frame(x = rnorm(100))

You could access the row positions found in V1 (or, in this example x) with:

df2[myrows[[1]],]
[1]  0.2986353 -0.9917691 -0.6510206

and those found in V2 (here named y) with:

df2[myrows[[2]],]
[1] -0.3148442 -0.2491949  0.6854260
LAP
  • 6,605
  • 2
  • 15
  • 28
  • Thank you! But I think for my purpose the earlier version of the code you posted worked better, because I don't need the values around 3 to be exactly within 0.2 and 0.4, as some values after ~0.3 can be up to around 0.46. With the additional c(.2, .4) I was getting too many NAs. – dumeir Jan 24 '18 at 15:01
  • If you don't mind, would you be able to help me with another problem? I want to know which rows these sets of 3 values were in each column e.g. in column 1 they were rows 7, 8, 9 but in column 2 they were rows 8, 9, 10 and so on. I then want to select the corresponding rows in a different data frame with the same dimensions, so the values ~0.2, ~0,3, ~0.4 aren't relevant in this case, and then once again calculate the mean. Thanks! – dumeir Jan 24 '18 at 15:02
  • You can just leave the `z` and the `[x > z[1] & x < z[2]]` out of the function to use it without the limits. I'll get to your other request and will edit a solution into my answer. – LAP Jan 25 '18 at 07:59
  • Thanks so much for your solution! So how do I take the row positions to find the corresponding rows in another data frame? I've got 2 other data frames - one has the same dimensions as the one I've been using, then the other one is just 1 column but has the same number of rows. – dumeir Jan 25 '18 at 11:22
  • THANK YOU!!! Everything has worked so far, but with the one column data frame, I want to use every column from the first data frame, of which I've got about 66 I think. Right now I've got this: set.seed(345) mean(time_hr[myrows[[1]],]) mean(time_hr[myrows[[2]],]) mean(time_hr[myrows[[3]],]) mean(time_hr[myrows[[4]],]) mean(time_hr[myrows[[5]],]) mean(time_hr[myrows[[6]],]) and so on. Is there a way to combine these into 1 list/data frame? – dumeir Jan 25 '18 at 13:24
  • Now that's what `sapply` is for: `time_hr_means <- sapply(myrows, function(x) mean(time_hr[x, ]))` will give you a vector of all those means. By the way, if this answers your question, feel free to check the "accept" button :) – LAP Jan 25 '18 at 13:26