1

I have a df and I want to extract the tissue name between the './' and '.v8' So for this df the result would be a column with just 'Thyroid', 'Esophagus_Muscularis', Adipose_Subcutaneous

gene<-c("ENSG00000065485.19","ENSG00000079112.9","ENSG00000079112")
tissue<-c("./Thyroid.v8.signif_variant_gene_pairs.txt.gz","./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz","./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz")
df<-data.frame(gene,tissue)

I really struggle with regex and tried:

pattern="/.\(.*)/.v8(.*)"
result <- regmatches(df$tissue,regexec(pattern,df$tissue))

but I get:

Error: '(' is an unrecognized escape in character string starting ""/.("

Tonio Liebrand
  • 17,189
  • 4
  • 39
  • 59
zoe
  • 301
  • 3
  • 11

1 Answers1

1

In R, we need to escape (\). Here, we used a regex lookaround that matches the word (\\w+) which succeeds the . (metacharacter - escaped) and the \, followed by the . (\\ escape) and 'v8'

library(stringr)
library(dplyr)
df %>% 
    mutate(new = str_extract(tissue, "(?<=\\.[/])\\w+(?=\\.v8)"))
#             gene                                                     tissue                  new
#1 ENSG00000065485.19              ./Thyroid.v8.signif_variant_gene_pairs.txt.gz              Thyroid
#2  ENSG00000079112.9 ./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz Esophagus_Muscularis
#3    ENSG00000079112 ./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz Adipose_Subcutaneous

The (?<=\\.[/]) - is a positive lookbehind to match the . and the / that precedes the word (\\w+), and (?=\\.v8) - positive lookahead to match the . and string 'v8' after the word. So, basically, it looks for a word that have a pattern before and after it and extracts the word

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Wow! That works - thankyou. Could you explain the regex? I understand the '\\.' is escaping the special character '.' and I think \\w+ is any word or character but don't know the rest? – zoe Sep 29 '19 at 23:01
  • 1
    @zoe Thank you. I updated with some description – akrun Sep 29 '19 at 23:03
  • Hi @akrun, Apologies but I have some rows that have a '-' in the string eg './Cells_EBV-transformed_lymphocytes.v8.signif_varianiant_gene_pairs.txt.gz' which I think is resulting in 'NA'. Is there a way to catch the '-' also?? – zoe Sep 29 '19 at 23:37
  • @zoe Would that make the dupe link not a dupe? – akrun Sep 30 '19 at 17:32