I want to select all rows in a larger dataset, whose identification number, exists in another dataset in R

Asked Jul 06 '22 at 19:15

Active Jul 06 '22 at 19:25

Viewed 33 times

I have a large dataset, lets call it df1 (4226 observations X 186 variables)

I used a package called naniar to assess missingness, and created a dataset that shows, for each observation, what the percentage of missing data is. I then filtered the dataset, to show me only the observations (rows), in which there was less then 50% of missing data. Then, I created a dataset of just the row number of all rows that fit the missingness criteria, we can call this df2

Now, I want to create a subset of dataset df1 using the data in df2 (2044 observations X 1 variable).

Can anyone help me here?

I have tried something like:

df3 <- df2[df2$row %in% df1]

edited Jul 06 '22 at 19:25

Rui Barradas

70,273
8
34
66

asked Jul 06 '22 at 19:15

ayzee

1

It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't need your actual data. You can just include a small examples to make things more clear. – MrFlick Jul 06 '22 at 19:17
table = v1 , v2, v3, v4 1 . . . 2 1 1 . 3 . . 1 4 1 . 1 Since I only want observations with missingness less than %50, I would only want rows 2 and 4. I – ayzee Jul 06 '22 at 19:20
2

looks like you want to do filtering joins. Read the documentation for `semi_join()` from dplyr package. – shafee Jul 06 '22 at 19:22
Perfect that is exactly it. – ayzee Jul 06 '22 at 19:23
1

Maybe swap the df's and match the column, not df2: `df3 <- df2[df1$row %in% df2$row,]`. And those are row indices, you forgot the comma. – Rui Barradas Jul 06 '22 at 19:25

I want to select all rows in a larger dataset, whose identification number, exists in another dataset in R

0 Answers0