Extract characters between two characters R

Question

I have a df and I want to extract the tissue name between the './' and '.v8' So for this df the result would be a column with just 'Thyroid', 'Esophagus_Muscularis', Adipose_Subcutaneous

gene<-c("ENSG00000065485.19","ENSG00000079112.9","ENSG00000079112")
tissue<-c("./Thyroid.v8.signif_variant_gene_pairs.txt.gz","./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz","./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz")
df<-data.frame(gene,tissue)

I really struggle with regex and tried:

pattern="/.\(.*)/.v8(.*)"
result <- regmatches(df$tissue,regexec(pattern,df$tissue))

but I get:

Error: '(' is an unrecognized escape in character string starting ""/.("

`unlist` them, `unlist(qdapRegex::ex_between(df$tissue, "/", ".v8"))` — Ronak Shah, Sep 30 '19 at 01:14

akrun · Answer 1 · 2019-09-29T23:02:59.917

1

In R, we need to escape (\). Here, we used a regex lookaround that matches the word (\\w+) which succeeds the . (metacharacter - escaped) and the \, followed by the . (\\ escape) and 'v8'

library(stringr)
library(dplyr)
df %>% 
    mutate(new = str_extract(tissue, "(?<=\\.[/])\\w+(?=\\.v8)"))
#             gene                                                     tissue                  new
#1 ENSG00000065485.19              ./Thyroid.v8.signif_variant_gene_pairs.txt.gz              Thyroid
#2  ENSG00000079112.9 ./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz Esophagus_Muscularis
#3    ENSG00000079112 ./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz Adipose_Subcutaneous

The (?<=\\.[/]) - is a positive lookbehind to match the . and the / that precedes the word (\\w+), and (?=\\.v8) - positive lookahead to match the . and string 'v8' after the word. So, basically, it looks for a word that have a pattern before and after it and extracts the word

edited Sep 29 '19 at 23:02

answered Sep 29 '19 at 22:57

akrun

874,273
37
540
662

Wow! That works - thankyou. Could you explain the regex? I understand the '\\.' is escaping the special character '.' and I think \\w+ is any word or character but don't know the rest? – zoe Sep 29 '19 at 23:01
1

@zoe Thank you. I updated with some description – akrun Sep 29 '19 at 23:03
Hi @akrun, Apologies but I have some rows that have a '-' in the string eg './Cells_EBV-transformed_lymphocytes.v8.signif_varianiant_gene_pairs.txt.gz' which I think is resulting in 'NA'. Is there a way to catch the '-' also?? – zoe Sep 29 '19 at 23:37
@zoe Would that make the dupe link not a dupe? – akrun Sep 30 '19 at 17:32

Extract characters between two characters R

1 Answers1