0

The interval that it should be deleted is within the tag <p>Advertisement</p> and the final tag </time> before the article starts. As you can see, the regular expression should delete the words on multiple lines. In a similar post I was suggested to use this regular expression. However, that one delete also the first tag and attribute that I would like to save.

import re

text='''
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author" 
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline" 
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html" 
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html" 
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00" 
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time>
</p>, <p class="story-body-text story-content" data-para-count="163"   data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.”</p>, <p class="story-body-text story-content" 
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also    injured. None of the three had life-threatening injuries.</p>
'''
my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s")  
results= re.sub(my_pattern," ", text)
print(results)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
M.Huntz
  • 253
  • 1
  • 6
  • 17
  • Do you want to keep the fragments `

    ` and drop everything else?

    – Laurent LAPORTE Nov 08 '16 at 16:25
  • 1
    Obligatory link to **[don't parse XML with regex](http://stackoverflow.com/a/1732454/1954610)**. Also [I don't reproduce](https://regex101.com/r/4Ih2pO/1) – Thomas Ayoub Nov 08 '16 at 16:28
  • Agree (and funny), but there are exceptions… – Laurent LAPORTE Nov 08 '16 at 16:30
  • There are several articles that share the same header. If I use that pattern I will delete everything but the last article from the file. Yes I want to keep that fragment and clean it later. I need a way to delete those headers without deleting all the previous articles – M.Huntz Nov 09 '16 at 09:53

1 Answers1

0
my_pattern=("(?<=<p>Advertisement</p>)[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s") 
  • I want to use this pattern on a text file with several articles that all start with that header. I thought that by applying that pattern I would have deleted just the header of the articles. However, this pattern delete all the previous articles and leave out just the last article without the header – M.Huntz Nov 09 '16 at 10:01