removing specific duplicates in R

Question

I have data that looks like this ( sample of it ) :

snp_id=c("chr16-54319851-C-A","chr16-54319851-C-A","chr16-54319851-C-A","chr16-54319851-C-A","chr10-100003732-A-G","")
AF_total=c("-","-",0.1,0.1,"-","-")
df=data.frame(snp_id,AF_total)

It can be seen that there are duplicates. What I wish to do is delete from this data frame the records that have "-" in the AF_total but only if the value in snp_id column is duplicated ( in this case it is) and there are value besides "-" (0.1) and after that delete the extra row with 0.1 value. For the other snp_id value I have duplicate of "-" so i would like to live only one "-" so in the end the data will look like this:

snp id                AF_total
chr16-54319851-C-A       0.1     
chr10-100003732-A-G       -

I looked at some QA like this :Remove duplicates in one column based on another column but sadly they didnt help. would be happy for help , thank you

Gregor Thomas · Accepted Answer · 2023-02-08T22:07:50.233

2

df %>%
  # grouped by snp_id
  group_by(snp_id) %>%
  # keep either (
  #  all the rows if all the AF_total values are "-"
  #  OR
  #  if there are some non-"-" values keep only the non-"-" values
  filter(all(AF_total == "-") | AF_total != "-") %>%
  ungroup() %>%
  # of the rows that remain, de-duplicate
  distinct(snp_id, AF_total, .keep_all = TRUE) %>%
  # drop any rows (like your last sample row) that have a blank snp_id
  filter(snp_id != "")
# # A tibble: 2 × 2
#   snp_id              AF_total
#   <chr>               <chr>   
# 1 chr16-54319851-C-A  0.1     
# 2 chr10-100003732-A-G -

edited Feb 08 '23 at 22:07

answered Feb 08 '23 at 21:15

Gregor Thomas

136,190
20
167
294

thank you :) my original data has more columns then snp_id and AF_tota (30 columns) when i used your code it created data with only snp_id and AF_tota , is there a way to keep all the columns ? – el rom Feb 08 '23 at 21:19
Add `.keep_all = TRUE ` to the `distinct()` command. Editing now... – Gregor Thomas Feb 08 '23 at 21:21
(And I hope you don't care which row is kept in the case of duplicates, `.keep_all` will keep the first.)] – Gregor Thomas Feb 08 '23 at 21:22
@GregorThomas .keep_all in case that we "save" the number here 0.1 will keep the first row with the number 0.1 and not "-" ? would really appreciate if you could provide a simple explanation for your code so i will get the best learning process from it – el rom Feb 08 '23 at 21:30
I added comments. Let me know if you have specific questions. – Gregor Thomas Feb 08 '23 at 22:08

score 0 · Answer 2 · answered Feb 09 '23 at 06:39

You could coerce AF_total as.numeric, which deletes the -, and sort the data first by snp_id then AF_total using order. NAs will have lower rank and you can simply use duplicated.

transform(df, AF_total=as.numeric(AF_total)) |>
  {\(.) .[with(., order(snp_id, AF_total)), ]}() |>
  subset(!(snp_id == '' | duplicated(snp_id)))
#                snp_id AF_total
# 5 chr10-100003732-A-G       NA
# 3  chr16-54319851-C-A      0.1

If the warning annoys you, you can use suppressWarnings().

Data:

df <- structure(list(snp_id = c("chr16-54319851-C-A", "chr16-54319851-C-A", 
"chr16-54319851-C-A", "chr16-54319851-C-A", "chr10-100003732-A-G", 
""), AF_total = c("-", "-", "0.1", "0.1", "-", "-")), class = "data.frame", row.names = c(NA, 
-6L))

removing specific duplicates in R

2 Answers2