How to extract substring with Regex in R

Question

I have the following string:

x <- "\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\n\t\t\tGEO Publications\n\t\t\t\t\tHandout\n\t\t\t\t\t\tNAR 2013 (latest)\n\t\t\t\t\t\tNAR 2002 (original)\n\t\t\t\t\t\tAll publications\n\t\t\t\t\t\n\t\t\t\tFAQ\n\t\t\t\tMIAME\n\t\t\t\tEmail GEO\n\t\t\t\n                    \n                \n                    \n                    \n                \n                    \n                           NCBI > GEO > Accession Display\nNot logged in | Login\n\n                    \n                \n                    \n                    \n                \n                    \n                        \n                                    \n\n \n \n\nGEO help: Mouse over screen elements for information.\n\nScope: SelfPlatformSamplesSeriesFamily\n  Format: HTMLSOFTMINiML\n  Amount: BriefQuick\n GEO accession:   \n\n\n\n    Sample GSM935277\n\nQuery DataSets for GSM935277\nStatus\nPublic on May 22, 2012\nTitle\nStanford_ChipSeq_GM12878_TBP_IgG-mus\nSample type\nSRA\n \n\nSource name\nGM12878\nOrganism\nHomo sapiens\nCharacteristics\nlab: Stanfordlab description: Snyder - Stanford Universitydatatype: ChipSeqdatatype description: Chromatin IP Sequencingcell: GM12878cell organism: humancell description: B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Viruscell karyotype: normalcell lineage: mesodermcell sex: Ftreatment: Nonetreatment description: No special treatment or protocol appliesantibody: TBPantibody antibodydescription: Mouse monoclonal. Immunogen is synthetic peptide conjugated to KLH derived from within residues 1 - 100 of HumanTATA binding protein TBP. Antibody Target: TBPantibody targetdescription: General transcription factor that functions at the core of the DNA-binding multiprotein factor TFIID. Binding of TFIID to the TATA box is the initial transcriptional step of the pre-initiation complex (PIC), playing a role in the activation of eukaryotic genes transcribed by RNA polymerase II."

What I want to do is to detect pattern in this form:

Antibody Target: TBPantibody

And return the substring result TBPantibody.

I tried this regex but it doesn't work

sub("Antibody Target: ([A-Zaz]+)\\W+", "\\1", x)

What's the right way to do it?

You are aware that deleting a question (a different one) silently, when the answer has been given could cause bad feelings in people who spent their time on helping you, aren't you? — Yunnosch, Jan 25 '19 at 00:37

score 2 · Accepted Answer · answered Jan 23 '19 at 02:33

2

You could do

sub(".*Antibody Target: ([A-Za-z]+).*", "\\1", x)
#[1] "TBPantibody"

answered Jan 23 '19 at 02:33

Ronak Shah

377,200
20
156
213

RavinderSingh13 · Answer 2 · 2019-01-23T03:30:26.603

Could you please try following once.

sub("(.*Antibody Target: )([^ ]*)",\\2,variable)

Explanation: As per OP's sample value is stored in variable named variable here. Using sub to substitute function of Base R here.

sub's syntax:

sub(/regex_to_match/,"get_value_either_from_memory_of_matched_regex OR place new variable/value to be there in matched part",variable_name_which needs to be worked on)

"(.*Antibody Target: )([^ ]*)": First mentioning regex where it matches from starting of variable's value till string Antibody Target: and keeping it in memory of R program((....) denotes that a match of mentioned regex is kept there. In second (..) mentioning regex to keep everything till first space occurence is there. Then \\2 means replacing whole variable value with 2nd part in memory(which should be matched string after Antibody..).

How to extract substring with Regex in R

2 Answers2