0

I have two dataframes, and I want to do a match and merge. Initially I was using inner_join and coalesce, but realized the match portion wasn't properly matching.

I found an example which seemed to be in the right direction How to merge two data frame based on partial string match with R? . One answer suggested using this code:

idx2 <- sapply(df_mouse_human$Protein.IDs, grep, df_mouse$Protein.IDs) idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]]))) merged <- cbind(df_mouse_human[unlist(idx1),,drop=F], df_mouse[unlist(idx2),,drop=F])

However it fell short. The issue being is the dataset that I want to use as the pattern match, has strings which are longer than what I want to match to, and thus didn't match anything. Let me show a subset of the data:

dput(droplevels(df_mouse))
structure(list(Protein.IDs = c("Q8CBM2;A2AL85;Q8BSY0", "A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8", 
"A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6", "Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15379-2;P15379-3;P15379-6;P15379-11;P15379-5;P15379-10;P15379-9;P15379-4;P15379-8;P15379-7;P15379;P15379-12;P15379-13", 
"A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78", "A2AUR7;Q9D031;Q01730"
), Replicate = c(2L, 2L, 2L, 2L, 2L, 2L), Ratio.H.L.normalized.01 = c(NaN, 
NaN, NaN, NaN, NaN, NaN), Ratio.H.L.normalized.02 = c(NaN, NaN, 
NaN, NaN, NaN, NaN), Ratio.H.L.normalized.03 = c(NaN, NaN, NaN, 
NaN, NaN, NaN)), .Names = c("Protein.IDs", "Replicate", "Ratio.H.L.normalized.01", 
"Ratio.H.L.normalized.02", "Ratio.H.L.normalized.03"), row.names = 12:17, class = "data.frame")

dput(droplevels(df_mouse_human))
structure(list(Human = c("Q8WZ42", "Q8NF91", "Q9UPN3", "Q96RW7", 
"Q8WXG9", "P20929", "Q5T4S7", "O14686", "Q2LD37", "Q92736"), 
    Protein.IDs = c("A2ASS6", "Q6ZWR6", "Q9QXZ0", "D3YXG0", "Q8VHN7", 
    "E9Q1W3", "A2AN08", "Q6PDK2", "A2AAE1", "E9Q401")), .Names = c("Human", 
"Protein.IDs"), row.names = c(NA, 10L), class = "data.frame")

So I want to match the Protein.IDs in df_mouse to where they exist in df_mouse_human. In the sample data I'm trying to match A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 to the entry A2ASS6. It works well if I do it the other way, but is there a way so that if part of the pattern matches the query, it will come back TRUE?

My long term goal is to match and merge the data, so that df_mouse gets a new column with the matching Human protein ids, and where there is no match I'll just replace the NA value with the original string of mouse IDs.

thanks

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
Dave R
  • 202
  • 1
  • 8
  • 1
    You'll likely have multiple matches for each (in your real data), what do you intend for the output? Also, you made a good question with sample data, but there is only one match among everything there, is that representative? That is (pursuant to my first question), if you actually get two `Human` matches in one row, what do you intend to happen? – r2evans Oct 15 '18 at 23:35

2 Answers2

2

One method I commonly use with partial matches like this is to reduce the more-complex field to make it look like the simpler one. Sometimes this involves just removing extraneous characters (e.g., if "match only on the first four chars", then I'd make a new index column from substr(idcol, 1, 4) and join on that), but in this case it involves breaking one string into multiple.

This involves associating each of the semi-colon-delimited ids with the big-string, making this intermediate frame taller (sometimes much taller) than the original data.

(For the sake of presentability/aesthetics, I'm modifying df1 to remove the other invariant columns and, for the sake of "other data", adding a row number column.)

I'm using dplyr and tidyr, so:

library(dplyr)
library(tidyr)
df1 <- select(df1, Protein.IDs) %>%
  mutate(other = row_number())

First I'll break the 6-row frame into a much larger one:

df1ids <- tbl_df(df1) %>%
  select(Protein.IDs) %>%
  mutate(eachID = strsplit(Protein.IDs, ";")) %>%
  unnest()
df1ids
# # A tibble: 46 x 2
#    Protein.IDs                                        eachID  
#    <chr>                                              <chr>   
#  1 Q8CBM2;A2AL85;Q8BSY0                               Q8CBM2  
#  2 Q8CBM2;A2AL85;Q8BSY0                               A2AL85  
#  3 Q8CBM2;A2AL85;Q8BSY0                               Q8BSY0  
#  4 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH3  
#  5 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH5  
#  6 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH4  
#  7 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        Q6X893  
#  8 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        Q6X893-2
#  9 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH8  
# 10 A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 A2AMW0  
# # ... with 36 more rows

Notice how the first row of three is now three rows of three. We'll use "eachID" to join.

left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
  filter(complete.cases(.)) %>%
  select(Human, Protein.IDs) %>%
  right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
#   Human  Protein.IDs                                                  other
#   <chr>  <chr>                                                        <int>
# 1 <NA>   Q8CBM2;A2AL85;Q8BSY0                                             1
# 2 <NA>   A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8                      2
# 3 <NA>   A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6               3
# 4 <NA>   Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~     4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                      5
# 6 <NA>   A2AUR7;Q9D031;Q01730                                             6

If you happen to have multiple Human rows for each Proteins.IDs, things change a little.

df2$Protein.IDs[2] <- "E9Q8K5"
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
  filter(complete.cases(.)) %>%
  select(Human, Protein.IDs) %>%
  right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 7 x 3
#   Human  Protein.IDs                                                  other
#   <chr>  <chr>                                                        <int>
# 1 <NA>   Q8CBM2;A2AL85;Q8BSY0                                             1
# 2 <NA>   A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8                      2
# 3 <NA>   A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6               3
# 4 <NA>   Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~     4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                      5
# 6 Q8NF91 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                      5
# 7 <NA>   A2AUR7;Q9D031;Q01730                                             6

Notice how you now have two copies of other 5? Likely not what you want. If you intend to continue with the semi-colon-delimited theme, though:

left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
  filter(complete.cases(.)) %>%
  group_by(Protein.IDs) %>%
  summarize(Human = paste(Human, collapse = ";")) %>%
  select(Human, Protein.IDs) %>%
  right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
#   Human       Protein.IDs                                             other
#   <chr>       <chr>                                                   <int>
# 1 <NA>        Q8CBM2;A2AL85;Q8BSY0                                        1
# 2 <NA>        A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8                 2
# 3 <NA>        A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6          3
# 4 <NA>        Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM~     4
# 5 Q8WZ42;Q8N~ A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                 5
# 6 <NA>        A2AUR7;Q9D031;Q01730                                        6
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Hi, bringing the values back into the semi-colon-delimited theme is ideal. The mouse data only says that the particular peptide sequence could belong to one of x different protein ids, so the return should be the x number of human protein ids it could potentially map to. Thank you for your solution! – Dave R Oct 16 '18 at 19:26
1

@r2evans asks a good question about what to do with multiple matches. Once that question gets answered, I may need to edit my answer, but here is a quick solution. First, we split up the string of possible IDs, then we see which IDs are matched in the other dataframe, then we join on the row index of the match.

library(tidyverse)

df_mouse %>% mutate(all_id = str_split(Protein.IDs, ";"),
                    row = map(all_id, ~.x %in% df_mouse_human$Protein.IDs %>% which())) %>%
  unnest(row) %>%
  list(., df_mouse_human %>% rownames_to_column("row") %>% mutate(row = as.numeric(row))) %>%
  reduce(left_join, by = "row")
#>                                 Protein.IDs.x Replicate
#> 1 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78         2
#>   Ratio.H.L.normalized.01 Ratio.H.L.normalized.02 Ratio.H.L.normalized.03
#> 1                     NaN                     NaN                     NaN
#>   row  Human Protein.IDs.y
#> 1   1 Q8WZ42        A2ASS6
AndS.
  • 7,748
  • 2
  • 12
  • 17