Use extract and/or separate to isolate variable string from dataframe

Question

I've looked through the following pages on using regex to isolate a string:

Regular expression to extract text between square brackets

What is a non-capturing group? What does (?:) do?

Split data frame string column into multiple columns

I have a dataframe which contains protein/gene identifiers, and in some cases there are two or more of these strings (seperated by a comma) because of multiple matches from a list. In this case the first string is the strongest match and I'm not necessarily interested in keeping the rest.They represent multiple matches from inferred evidence and when they cannot be easily discriminated all of the hits get put into a column. In this case I'm only interested in keeping the first because the group will likely have the same type of annotation (i.e. type of protein, gene ontology, similar function etc) If I split the multiple entries into more rows then it would appear that I have evidence that they exist in my dataset, but at the empirical level I don't.

My dataframe:

                protein
1 sp|P50213|IDH3A_HUMAN
2  sp|Q9BZ95|NSD3_HUMAN
3  sp|Q92616|GCN1_HUMAN
4 sp|Q9NSY1|BMP2K_HUMAN
5  sp|O75643|U520_HUMAN
6 sp|O15357|SHIP2_HUMAN
523 sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|
524 sp|Q96KB5|TOPK_HUMAN
525 sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN
526 sp|O00299|CLIC1_HUMAN
527 sp|P25940|CO5A3_HUMAN

The output I am trying to create:

uniprot gene
P50213 IDH3A
Q9BZ95 NSD3
Q92616 GCN1
P12277 KCRB

I'm trying to use extract and separate functions to do this:

extract(df, protein, into = c("uniprot", "gene"), regex = c("sp|(.*?)|"," 
(.*?)_"), remove = FALSE)

results in:

Error: is_string(regex) is not TRUE

trying separate to at least break apart the two in multiple steps:

separate(df, protein, into = c("uniprot", "gene"), sep = "|", remove = 
FALSE)

results in:

Warning message:
Expected 2 pieces. Additional pieces discarded in 528 rows [1, 2, 3, 4, 5, 
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...]. 
                protein uniprot gene
1 sp|P50213|IDH3A_HUMAN            s
2  sp|Q9BZ95|NSD3_HUMAN            s
3  sp|Q92616|GCN1_HUMAN            s
4 sp|Q9NSY1|BMP2K_HUMAN            s
5  sp|O75643|U520_HUMAN            s
6 sp|O15357|SHIP2_HUMAN            s

What is the best way to use regex in this scenario and are extract or separate the best way to go about this? Any suggestion would be greatly appreciated. Thanks!

Update based on feedback:

df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN",  "sp|Q9BZ95|NSD3_HUMAN", 
                             "sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN", 
                             "sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|", 
                             "sp|Q96KB5|TOPK_HUMAN",   "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN", 
                             "sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1", 
                                                                                            "2", "3", "4", "5", "6", "523", "524", "525", "526"))
df1 <- separate(df, protein, into = "protein", sep = ",") 
#i'm only interested in the first match, because science

df2 <- extract(df1, protein, into = c("uniprot", "gene"), regex = "sp\\| 
([^|]+)\\|([^_]+)", remove = FALSE) 
#create new columns with uniprot code and gene id, no _HUMAN

#df2
#                 protein uniprot  gene
#1   sp|P50213|IDH3A_HUMAN  P50213 IDH3A
#2    sp|Q9BZ95|NSD3_HUMAN  Q9BZ95  NSD3
#3    sp|Q92616|GCN1_HUMAN  Q92616  GCN1
#4   sp|Q9NSY1|BMP2K_HUMAN  Q9NSY1 BMP2K
#5    sp|O75643|U520_HUMAN  O75643  U520
#6   sp|O15357|SHIP2_HUMAN  O15357 SHIP2
#523  sp|P10599|THIO_HUMAN  P10599  THIO
#524  sp|Q96KB5|TOPK_HUMAN  Q96KB5  TOPK
#525  sp|P12277|KCRB_HUMAN  P12277  KCRB
#526 sp|O00299|CLIC1_HUMAN  O00299 CLIC1

#and the answer using %>% pipes (this is what I aspire to)
df_filtered <- df %>%
separate(protein, into = "protein", sep = ",") %>%
extract(protein, into = c("uniprot", "gene"), regex = "sp\\|([^|]+)\\|([^_]+)") %>%
select(uniprot, gene)

#df_filtered
#    uniprot  gene
#1    P50213 IDH3A
#2    Q9BZ95  NSD3
#3    Q92616  GCN1
#4    Q9NSY1 BMP2K
#5    O75643  U520
#6    O15357 SHIP2
#523  P10599  THIO
#524  Q96KB5  TOPK
#525  P12277  KCRB
#526  O00299 CLIC1

What about lines 523 and 525? Those look like multiple rows combined, how do you want them to be separated? — acylam, Sep 05 '18 at 14:30
They represent multiple matches from inferred evidence and when they cannot be easily discriminated all of the hits get put into a column. In this case I'm only interested in keeping the first because the group will likely have the same type of annotation (i.e. type of protein, gene ontology, similar function etc) If I split the multiple entries into more rows then it would appear that I have evidence that they exist in my dataset, but at the empirical level I don't. — Adam Rabalski, Sep 05 '18 at 14:36
You should add this comment to your question and update your desired output to reflect that — acylam, Sep 05 '18 at 14:37

akrun · Accepted Answer · 2018-09-05T16:28:36.093

We can capture the pattern as a group ((...)) in extract. Here, we match sp at the beginning (^) of the string followed by a | (metacharacter - escaped \\), followed by one or more characters not a | captured as a group, followed by a | and the second set of characters captured

library(tidyverse)
extract(df, protein, into = c("uniprot", "gene"), 
     regex = "^sp\\|([^|]+)\\|([^|]+).*")

If there are multiple instances of 'sp', then separate the rows into long format with separate_rows and then use extract

df %>%
   separate_rows(protein, sep=",") %>%
   extract(protein, into = c("uniprot", "gene"), 
    "^sp\\|([^|]+)\\|([^|]*).*")

There is one instance where there is only two sets of words. To make it working

df %>%
  separate_rows(protein, sep=",") %>% 
  extract(protein, into = "gene", "([^|]*HUMAN)", remove = FALSE) %>% 
  mutate(uniprot = str_extract(protein, "(?<=sp\\|)[^_]+(?=\\|)")) %>%
  select(uniprot, gene)
#   uniprot        gene
#1   P50213 IDH3A_HUMAN
#2   Q9BZ95  NSD3_HUMAN
#3   Q92616  GCN1_HUMAN
#4   Q9NSY1 BMP2K_HUMAN
#5   O75643  U520_HUMAN
#6   O15357 SHIP2_HUMAN
#7   P10599  THIO_HUMAN
#8     <NA>  THIO_HUMAN
#9   Q96KB5  TOPK_HUMAN
#10  P12277  KCRB_HUMAN
#11  P17540  KCRS_HUMAN
#12  P12532  KCRU_HUMAN
#13  O00299 CLIC1_HUMAN

data

df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN",  "sp|Q9BZ95|NSD3_HUMAN", 
   "sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN", 
  "sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|", 
  "sp|Q96KB5|TOPK_HUMAN",   "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN", 
  "sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "523", "524", "525", "526"))

right now the extract function isolates `IDH3A_HUMAN` what if if I just want `IDH3A`? I've tried `^sp\\|C[^|]+)\\|([^|]+)\\_` as I want everything leading to the `_` but I get an error saying `Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)` same warning when I use `^sp\\|C[^|]+)\\|([^_HUMAN]+)` in trying to say "match everything after the `|` but `_HUMAN`" — Adam Rabalski, Sep 05 '18 at 16:25
OK I think I got it like this `sp\\|([^|]+)\\|([^_]+)` which says take everything but the `_` correct? Is the inferred function of the `[^_]+` mean "everything but not the `_` and stop there" ? My output doesn't include `HUMAN` which is why I ask. Will update my post with my answer soon. — Adam Rabalski, Sep 05 '18 at 17:08
@AdamRabalski It implies `sp` followed by a `|` (metacharacter - escaped `\\`), followed by one or more character that are not a `|` `([^|]+`)` captured as group followed by a `|` and the second captured group of characters that are not `[` — akrun, Sep 06 '18 at 01:25

Use extract and/or separate to isolate variable string from dataframe

1 Answers1

data