how to delete a part of a text in python

Question

I am pretty new to python so got stuck in this problem:

there is a txt file like

blahh
blah
blah 
...
<start>
 some stuff
</start>
even more blah blah blah

I want to delete all the blah parts before the <start> and after the </start>. (The main thing is coming from this link. I want to make the html stuff in the page by bs4, so I think I must first delete all the non-html parts.

Can someone please tell me What is the best way to do this? Appreciate any helps!

@A.J.: Please don't suggest parsing HTML with regexes. Read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags (and linking to a tag is just useless.) — Wooble, Feb 06 '15 at 17:10

score 1 · Accepted Answer · answered Feb 06 '15 at 17:09

Nope, you don't need to delete the non-relevant part of the file. Let the BeautifulSoup parse the complete file as is and find the tag you need:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.sec.gov/Archives/edgar/data/70858/000119312507058027/0001193125-07-058027.txt'
soup = BeautifulSoup(urlopen(url))
print(soup.document)

how to delete a part of a text in python

1 Answers1