How can I filter a dataframe based on (randomly selected) unique values of a column?

Question

I read some articles here on how to filter based on specific values in a given column. However, what I am interested in is whether I can filter randomly selected unique values of a column. To better understand my question, please consider the following sample dataframe:

MeasurementPoint <- c(1,2,1,2,3,3,4,4,6,7,6,7)
subject <- c(1,1,1,1,2,2,3,3,4,4,4,4)
MeasurementMethod <- c("A","A", "B", "B", "A","B", "A","B","A","A", "B","B")
value <- c(-0.06, 0.11,-0.11,-0.01.-0.13, 0.02, -0.08, 0.09, 0.05, 0.04, -0.03, -0.02)
df1 <- data.frame(MeasurementPoint, subject,MeasurementMethod, value)
df1

 MeasurementPoint subject MeasurementMethod value
         1            1            A        -0.06
         2            1            A         0.11
         1            1            B        -0.11
         2            1            B        -0.01
         3            2            A        -0.13
         3            2            B         0.02
         4            3            A        -0.08
         4            3            B         0.09
         6            4            A         0.05
         7            4            A         0.04
         6            4            B        -0.03
         7            4            B        -0.02

Some values are measured on different subjects with two different MeasurementMethod and on different MeasurementPoints, e.g. multiple spots on their body.

Some subjects have more than one MeasurementPoints like subject #1 and #4. The rest have only one MeasurementPoint on their bodies, and only the MeasurementMethod varies for them (subject #2 and #3).

I would like to filter only one MeasurementPoint per subject and leave the rest. This selection should be "randomly" done. And as an example the follwoing dataframe would be an outcome of interest:

  MeasurementPoint subject MeasurementMethod value
                2       1                 A  0.11
                2       1                 B -0.01
                3       2                 A -0.13
                3       2                 B  0.02
                4       3                 A -0.08
                4       3                 B  0.09
                6       4                 A  0.05
                6       4                 B -0.03

Please note that the selection of MeasurementPoint = 2 for the first subject and MeasurementPoint = 6 for the last subject should happen randomly.

I think you need "group by column(s) get sample row(s)" see if this post helps: https://stackoverflow.com/q/18258690/680068 — zx8754, Apr 13 '22 at 12:40
Hi @zx8754 thanks for your comment! It's not duplicate since in that post the goal is to pick a definite number of observations (500) per ID, but I would like to randomly select a **MeasurementPoint** per **subject** and keep them all. — SteveMcManaman, Apr 13 '22 at 12:54
using `new_df <- df1 %>% group_by(subject) %>% slice_sample(n=1)` I do not get what I want. However, @benson23 left an answer which generates what I was looking for. Thanks! — SteveMcManaman, Apr 13 '22 at 13:12

score 2 · Accepted Answer · answered Apr 13 '22 at 12:58

2

We can group_by the subject column, and filter rows that match the random MeasurementPoint value generated by sample.

library(dplyr)

df1 %>% 
  group_by(subject) %>% 
  filter(MeasurementPoint == sample(MeasurementPoint, 1))

# A tibble: 8 × 4
# Groups:   subject [4]
  MeasurementPoint subject MeasurementMethod value
             <dbl>   <dbl> <chr>             <dbl>
1                1       1 A                 -0.06
2                1       1 B                 -0.11
3                3       2 A                 -0.13
4                3       2 B                  0.02
5                4       3 A                 -0.08
6                4       3 B                  0.09
7                6       4 A                  0.05
8                6       4 B                 -0.03

answered Apr 13 '22 at 12:58

benson23

16,369
9
19
38

If the MeasurementPoint has a duplicated value, this will return both rows. Not a single row. – zx8754 Apr 13 '22 at 14:09
@zx8754 If you take a look at the OP's desired outcome, it's exactly what the OP wants. Also, I disagree it's a duplicate of the post in your comment, and I disagree `slice_sample` can do the job in this case – benson23 Apr 13 '22 at 14:13
I see, you are right, apologies, I misread the question. – zx8754 Apr 13 '22 at 14:18

How can I filter a dataframe based on (randomly selected) unique values of a column?

1 Answers1