0

I have to extract the text between

</cons> and <con

multiple times as appearing in the sentences of the text file by using Notepad++ My exemplary data is like this:

<abstract>
<sentence>The <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> interacts with <cons lex="non-polymorphic_region" sem="G#protein_domain_or_region">non-polymorphic regions</cons> of <cons lex="major_histocompatibility_complex_class_II_molecule" sem="G#protein_family_or_group">major histocompatibility complex class II molecules</cons> on <cons lex="antigen-presenting_cell" sem="G#cell_type">antigen-presenting cells</cons> and contributes to <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons>.</sentence>
<sentence>We have investigated the effect of <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> on <cons lex="T_cell_activating_signal" sem="G#other_name">T cell activating signals</cons> in a <cons lex="lymphoma_model" sem="G#other_name">lymphoma model</cons> using <cons lex="monoclonal_antibody" sem="G#protein_family_or_group">monoclonal antibodies</cons> (<cons lex="mAb" sem="G#protein_domain_or_region">mAb</cons>) which recognize different <cons lex="CD4_epitope" sem="G#protein_family_or_group">CD4 epitopes</cons>.</sentence>
<sentence>We demonstrate that <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> delivers signals capable of activating the <cons lex="NF-AT_transcription_factor" sem="G#protein_molecule">NF-AT transcription factor</cons> which is required for <cons lex="interleukin-2_gene_expression" sem="G#other_name"><cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> gene expression</cons>.</sentence>
<sentence>Whereas different <cons lex="anti-CD4_mAb" sem="G#protein_family_or_group">anti-CD4 mAb</cons> or <cons lex="HIV-1_gp120" sem="G#protein_molecule"><cons lex="HIV-1" sem="G#virus">HIV-1</cons> gp120</cons> could all trigger activation of the <cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinases</cons> <cons lex="p56lck" sem="G#protein_molecule">p56lck</cons> and <cons lex="p59fyn" sem="G#protein_molecule">p59fyn</cons> and phosphorylation of the <cons lex="Shc_adaptor_protein" sem="G#protein_molecule">Shc adaptor protein</cons>, which mediates signals to <cons lex="Ras" sem="G#protein_family_or_group">Ras</cons>, they differed significantly in their ability to activate <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons>.</sentence>
<sentence>Lack of full activation of <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons> could be correlated to a dramatically reduced capacity to induce <cons lex="calcium_flux" sem="G#other_name"><cons lex="calcium" sem="G#atom">calcium</cons> flux</cons> and could be complemented with a <cons lex="calcium_ionophore" sem="G#other_organic_compound">calcium ionophore</cons>.</sentence>
<sentence>The results identify functionally distinct <cons lex="epitope" sem="G#protein_family_or_group">epitopes</cons> on the <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> involved in activation of the <cons lex="Ras/protein_kinase_C_and_calcium_pathway" sem="G#other_name"><cons lex="Ras/protein_kinase_C" sem="G#protein_molecule"><cons lex="Ras/protein_kinase_C_pathway" sem="G#other_name"><cons lex="Ras" sem="G#protein_molecule">Ras</cons><cons lex="protein_kinase_C" sem="G#protein_molecule">/protein kinase C</cons></cons></cons> and <cons lex="calcium_pathway" sem="G#other_name">calcium pathways</cons></cons>.</sentence>
 </abstract>

my desired out put is

interacts with 
of 
on 
and contributes to
on 
in 
using 
which recognize different 
triggering
delivers signals capable of activating the
which is required for 
or 
could all trigger activation of the 
and

I tried regex

 .*<\/cons>(.*?)<cons.*  and replace with with $1

which only gives me the data in the last occurrence of the

</cons> and <con 

from each sentence while my sentences contains more than one occurrences of these tags. Anyone who can help me??

Alexey Gorozhanov
  • 706
  • 10
  • 20
Shaheen Gul
  • 61
  • 2
  • 4
  • 10

3 Answers3

0
  1. Go to Search --> Replace in Notepad ++
  2. Select search mode as Regular expression
  3. In Find what field put regular expression as "<[^>]+>" and in Replace with field put space and click on Replace All,

It will replace all xml tags with space (You can also put newline character in Replace with field )

It will left you with string :-

The CD4 coreceptor interacts with non-polymorphic regions of major histocompatibility complex class II molecules on antigen-presenting cells and contributes to T cell activation . We have investigated the effect of CD4 triggering on T cell activating signals in a lymphoma model using monoclonal antibodies ( mAb ) which recognize different CD4 epitopes . We demonstrate that CD4 triggering delivers signals capable of activating the NF-AT transcription factor which is required for interleukin-2 gene expression . Whereas different anti-CD4 mAb or HIV-1 gp120 could all trigger activation of the protein tyrosine kinases p56lck and p59fyn and phosphorylation of the Shc adaptor protein , which mediates signals to Ras , they differed significantly in their ability to activate NF-AT . Lack of full activation of NF-AT could be correlated to a dramatically reduced capacity to induce calcium flux and could be complemented with a calcium ionophore . The results identify functionally distinct epitopes on the CD4 coreceptor involved in activation of the Ras /protein kinase C and calcium pathways .

I hope it helps.

0

Using a regex to parse XML is difficult. It is better to use an XML parser. The following Python 3 SAX content parser tracks when a </cons> end tag is parsed (self.state = 1), if it is immediately followed by text content (self.state = 2), and then immediately followed by a cons start element. If so, it prints the content:

import xml.sax

data = b'''\
<abstract>
<sentence>The <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> interacts with <cons lex="non-polymorphic_region" sem="G#protein_domain_or_region">non-polymorphic regions</cons> of <cons lex="major_histocompatibility_complex_class_II_molecule" sem="G#protein_family_or_group">major histocompatibility complex class II molecules</cons> on <cons lex="antigen-presenting_cell" sem="G#cell_type">antigen-presenting cells</cons> and contributes to <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons>.</sentence>
<sentence>We have investigated the effect of <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> on <cons lex="T_cell_activating_signal" sem="G#other_name">T cell activating signals</cons> in a <cons lex="lymphoma_model" sem="G#other_name">lymphoma model</cons> using <cons lex="monoclonal_antibody" sem="G#protein_family_or_group">monoclonal antibodies</cons> (<cons lex="mAb" sem="G#protein_domain_or_region">mAb</cons>) which recognize different <cons lex="CD4_epitope" sem="G#protein_family_or_group">CD4 epitopes</cons>.</sentence>
<sentence>We demonstrate that <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> delivers signals capable of activating the <cons lex="NF-AT_transcription_factor" sem="G#protein_molecule">NF-AT transcription factor</cons> which is required for <cons lex="interleukin-2_gene_expression" sem="G#other_name"><cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> gene expression</cons>.</sentence>
<sentence>Whereas different <cons lex="anti-CD4_mAb" sem="G#protein_family_or_group">anti-CD4 mAb</cons> or <cons lex="HIV-1_gp120" sem="G#protein_molecule"><cons lex="HIV-1" sem="G#virus">HIV-1</cons> gp120</cons> could all trigger activation of the <cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinases</cons> <cons lex="p56lck" sem="G#protein_molecule">p56lck</cons> and <cons lex="p59fyn" sem="G#protein_molecule">p59fyn</cons> and phosphorylation of the <cons lex="Shc_adaptor_protein" sem="G#protein_molecule">Shc adaptor protein</cons>, which mediates signals to <cons lex="Ras" sem="G#protein_family_or_group">Ras</cons>, they differed significantly in their ability to activate <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons>.</sentence>
<sentence>Lack of full activation of <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons> could be correlated to a dramatically reduced capacity to induce <cons lex="calcium_flux" sem="G#other_name"><cons lex="calcium" sem="G#atom">calcium</cons> flux</cons> and could be complemented with a <cons lex="calcium_ionophore" sem="G#other_organic_compound">calcium ionophore</cons>.</sentence>
<sentence>The results identify functionally distinct <cons lex="epitope" sem="G#protein_family_or_group">epitopes</cons> on the <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> involved in activation of the <cons lex="Ras/protein_kinase_C_and_calcium_pathway" sem="G#other_name"><cons lex="Ras/protein_kinase_C" sem="G#protein_molecule"><cons lex="Ras/protein_kinase_C_pathway" sem="G#other_name"><cons lex="Ras" sem="G#protein_molecule">Ras</cons><cons lex="protein_kinase_C" sem="G#protein_molecule">/protein kinase C</cons></cons></cons> and <cons lex="calcium_pathway" sem="G#other_name">calcium pathways</cons></cons>.</sentence>
 </abstract>'''

class Handler(xml.sax.ContentHandler):

    def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        self.state = 0
        self.content = ''

    def characters(self,content):
        if self.state == 1:
            self.content = content
            self.state = 2
        else:
            self.state = 0

    def startElement(self,name,attr):
        if name == 'cons' and self.state == 2:
            print(self.content)
        self.state = 0

    def endElement(self,name):
        if name == 'cons':
            self.state = 1
        else:
            self.state = 0

xml.sax.parseString(data,Handler())

Output:

 interacts with 
 of 
 on 
 and contributes to 
 on 
 in a 
 using 
 (
) which recognize different 
 delivers signals capable of activating the 
 which is required for 
 or 
 could all trigger activation of the 

 and 
 and phosphorylation of the 
, which mediates signals to 
, they differed significantly in their ability to activate 
 could be correlated to a dramatically reduced capacity to induce 
 and could be complemented with a 
 on the 
 involved in activation of the 
 and 

Here's the best I could do with a regex in Notepad++. It handles all but text after the last replacement:

enter image description here

Output:

 interacts with  of  on  and contributes to  on  in a  using  () which recognize different  delivers signals capable of activating the  which is required for  or  could all trigger activation of the   and  and phosphorylation of the , which mediates signals to , they differed significantly in their ability to activate  could be correlated to a dramatically reduced capacity to induce  and could be complemented with a  on the  involved in activation of the  and  lex="calcium_pathway" sem="G#other_name">calcium pathways</cons></cons>.</sentence>
 </abstract>
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • encountered an error TypeError: super() takes at least 1 argument (0 given) – Shaheen Gul Apr 12 '15 at 17:38
  • I changed the code to work in both Python 2 and Python 3. Tested with Python 2.7.9 and 3.3.5. – Mark Tolonen Apr 12 '15 at 17:57
  • wow ! It worked thx alot, one thing more, if i do this by file handling, i mean if i use the whole file and also store the result in the file then what changes will have to be made by me in the above code?? – Shaheen Gul Apr 12 '15 at 18:04
0

There is a simple way on extracting the data as the above mentioned in notepad ++ is

search .*?</cons>([^<]*?)<cons
replace \1\r\n
Shaheen Gul
  • 61
  • 2
  • 4
  • 10