0

I am pretty new to python so got stuck in this problem:

there is a txt file like

blahh
blah
blah 
...
<start>
 some stuff
</start>
even more blah blah blah

I want to delete all the blah parts before the <start> and after the </start>. (The main thing is coming from this link. I want to make the html stuff in the page by bs4, so I think I must first delete all the non-html parts.

Can someone please tell me What is the best way to do this? Appreciate any helps!

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    @A.J.: Please don't suggest parsing HTML with regexes. Read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags (and linking to a tag is just useless.) – Wooble Feb 06 '15 at 17:10

1 Answers1

1

Nope, you don't need to delete the non-relevant part of the file. Let the BeautifulSoup parse the complete file as is and find the tag you need:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.sec.gov/Archives/edgar/data/70858/000119312507058027/0001193125-07-058027.txt'
soup = BeautifulSoup(urlopen(url))
print(soup.document)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195