Subsetting a dataframe based on a vector of strings

Question

I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.

As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.

dedupedGenID<- unique(Genetics$ID) Will only give me the unique IDs, without the column.

In order to subset the df by those unique IDs I did

dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]

This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.

My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?

Any thoughts would be appreciated.

score 0 · Answer 1 · answered Jan 16 '20 at 13:57

0

library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]

This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.

answered Jan 16 '20 at 13:57

SideDeveloper

146
7

Thank you for your interest. Unfortunately, this is not an option for me, as I need to pull the IDs **with** the remaining three columns. – Wojty Jan 16 '20 at 14:45
You can use `duplicated2 <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)` to catch all duplicates. – alan ocallaghan Mar 03 '20 at 14:08

score 0 · Accepted Answer · answered Jan 16 '20 at 13:57

0

We can use duplicated to get ID that are multiplicated and use that to subset data

subset(Genetics, ID %in% unique(ID[duplicated(ID)]))

Another approach could be to count number of rows by ID and select rows which are more than 1.

This can be done in base R :

subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)

dplyr

library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)

and data.table

library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]

answered Jan 16 '20 at 13:57

Ronak Shah

377,200
20
156
213

Thank you for your interest. ```subset(Genetics, ID %in% unique(ID[duplicated(ID)]))``` worked! thank you so much! Could you please walk me through it? – Wojty Jan 16 '20 at 14:49
@WojciechBanaś yes, just go through it step-by-step. `duplicated(ID)` returns a logical vector of `TRUE`/`FALSE` values. We subset the ones which are repeated `ID[duplicated(ID)]` and then select the `unique` ones from it. We then use those `ID`s in`subset` – Ronak Shah Jan 16 '20 at 14:59
Thank you for getting back to me! What I don't understand is why select unique ones from the duplicated ones? I am sorry, it just doesn't make sense to me. The logical way (for me) would be to use ```subset(Genetics, ID%in%unique(Genetics$ID) ``` in order to find the IDs that do not repeat. What am I missing? – Wojty Jan 16 '20 at 15:13
Using `unique` would return all the ID irrespective of how many times they occur i.e even those ID that occur once but with duplicated it would only contain ID that occur multiple times. – Ronak Shah Jan 16 '20 at 15:53

Subsetting a dataframe based on a vector of strings

2 Answers2