Identify elements from df1 in df2, then add column in df2 in those rows that were coincident using R

Question

I have a dataframe with two columns (genome) and a dataframe with one column (list_SSNP).

What I am trying to do is to add a third and fourth columns in my Genome dataframe and add the value "1" for those positions in Genome that appear in list_SSNP and, separately, in list_SCPG.

I am trying to get an output dataframe that looks like this:

Gene_Symbol       CHR        SNP     
A1BG             19q13.43             
PDE1C            12p13.31     1

This is part of the content of Genome and I have included a reproducible example:

Genome <- c()
Genome$Gene_Symbol <- c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C")     
Genome$CHR <- c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31")
Gene_Symbol CHR
        1   A1BG        19q13.43
        2   A1BG-AS1    19q13.43
        3   A1CF        10q11.23
        4   A2M         12p13.31
        5   PDE1C       12p13.31

And this is part of the content of list_SSNP:

list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
    Gene_Symbol
1   PDE1C
2   IMMP2L
3   ZCCHC14
4   NOS1AP
5   HARBI1

Using only 1 of the dataframes (list_SSNP), which is what I am attempting to do first, what I have tried to do is a loop through the genome dataframe and for element i (row) in my Genome if the element i of my list_SSNP dataframe is like element i in my Genome dataframe, then add the number 1 to a third column, but when I execute this code, nothing happens.

Full_genome <- read.table("FULL_GENOME.txt", header=TRUE, sep = "\t", dec = ',', na.strings=c("","NA"), fill=TRUE)
Genome <- Full_genome[,c(2,3)]
names(Genome) <- c("Gene_Symbol", "CHR")

list_SSNP <- as.data.frame(Gene_SSNP$Gene_Symbol)

for (i in 1: dim (Genome) [1]) {
  if(list_SSNP[i] %in% Genome[i,1]){
    Genome[i,3] <- 1 
  }
}

Just to further clarify, I have checked that all the elements from list_SSNP appear in Genome, so it is absolutely certain that it is not a matter of not finding any coincidences.

EDIT:

I have come to realize that my example does not specify that the entries in list_SSNP and Genome are unique and have no duplicates and that Genome has about 30k lines of entries, while list_SSNP has 49. I just want to add a column in Genome and a number 1 in those rows where the entry exists in both Genome and list_SSNP.

Hi, you should make a better example, read: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 — jay.sf, Aug 08 '20 at 10:59
Hi, thanks for the advice, I have updated the question now according to the guidelines in that post, I thought I had provided enough background. — Alejandra_RS, Aug 08 '20 at 11:35

score 1 · Accepted Answer · answered Aug 08 '20 at 12:38

I believe this could help. You can try this code:

#Data
Genome <- data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
                     CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
                     stringsAsFactors = F)
list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
#Collapse
vecc <- paste0(list_SSNP,collapse = '|')
#Contrast
Genome$SNP <- as.numeric(grepl(pattern = vecc,x = Genome$Gene_Symbol))

Output:

  Gene_Symbol      CHR SNP
1        A1BG 19q13.43   0
2    A1BG-AS1 19q13.43   0
3        A1CF 10q11.23   0
4         A2M 12p13.31   0
5       PDE1C 12p13.31   1

I chose this solution because it was the most direct and simple. Thank you so much for this. I will now read about the collapsing (I am quite new in R and in programming in general so I am not sure about what this function does...). Again, thank you so much!! — Alejandra_RS, Aug 08 '20 at 21:52

score 1 · Answer 2 · answered Aug 08 '20 at 13:10

I may miss something important here, but the problem is formulated quite specifically to its domain. So, when I abtsracted it, I may have overseen an issue with my proposed solultion.

However, I understand that list_SSNP can have a SNP entry multiple times. So first of all, you could create a list of unique SNPs with the count of its occurences

library(dplyr)

list_SSNP = data.frame(SNP = c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1"))
unique_SSNP = list_SSNP %>% 
    group_by(SNP) %>% 
    # the summarize() could be replaced by count I guess, but I usually use this for more control
    summarize(count = n())

And now you use a left_join

Genome = data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
                     CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
                     stringsAsFactors = F)

Genome_extended = Genome %>% 
    left_join(unique_SSNP, by = c("Gene_Symbol" = "SNP"))

The count column in the extended dataframe would be NAs for non-existing SNPs and you could fill the NAs with a variety of commands from dplyr, tidyr or even base R.

Hi, first of all, thank you for your answer. The list_SSNP doesn't have the same entry multiple times, it is already a list of unique occurrences. I thought that the left join would work only if the dataframes had the same length, that's my bad; the Genome dataframe has 30k entries and the list_SSNP one only 49. Would this work anyways? thanks — Alejandra_RS, Aug 08 '20 at 13:25
yes, using the group_by() + summarize() would generate a df with the SNP names and a count column full of "1"s... and the left join would add the "1" to all matching Genome symbols. So, if you have two combinations of a Gene_symbol with different CHRs, both would get the 1... (and since I think this might be relevant to you: if you want to rename the added column, you can replace the "count" on the LHS in the summarize() with whatever you want. It would be the name of the added column. If the "count" column is supposed to be "SNP", you have to use a different name where I used SNP as name) — Racooneer, Aug 08 '20 at 13:30

Identify elements from df1 in df2, then add column in df2 in those rows that were coincident using R

2 Answers2