How does one strip text between two delimiters including empty lines?

Question

I am trying to remove text between these two delimiters: '<' & '>'. I am reading email content and then writing that content to a .txt file.I get a lot of junk between those two delimiters including whitespace between lines in my .txt file. How do I get rid of this? Below is what my script has been reading from the data written to my .txt file:

 First Name</td>

                <td bgcolor='white' style='padding:5px

 !important;'>Austin</td>

                </tr><tr>

                <td bgcolor='#f9f9f9' style='padding:5px !important;'

 valign='top' width=170>Last Name</td>

Below is my current code for reading from the .txt file which strips empty lines:

    # Get file contents
    fd = open('emailtext.txt','r')
    contents = fd.readlines()
    fd.close()

    new_contents = []

    # Get rid of empty lines
    for line in contents:
        # Strip whitespace, should leave nothing if empty line was just       "\n"
        if not line.strip():
            continue
        # We got something, save it
        else:
            new_contents.append(line)

    for element in new_contents:
        print element

Here is what is expected:

 First Name     Austin      


 Last Name      Jones

Ditto @Farhan.K , but add a few input/expected/got doohickeys (Technical term) — SIGSTACKFAULT, Nov 29 '16 at 15:22

score 0 · Answer 1 · answered Nov 29 '16 at 15:13

0

markup = '<td bgcolor='#f9f9f9' style='padding:5px !important;'

 valign='top' width=170>Last Name</td>'
soup = BeautifulSoup(markup)
soup.get_text()

you can use BeautifulSoup

answered Nov 29 '16 at 15:13

backtrack

7,996
5
52
99

score 0 · Answer 2 · answered Nov 29 '16 at 15:17

0

You should consider using a regex and the re.sub function:

import re
print re.sub(r'<.*?>', '', text, re.DOTALL)

Even if the suggestion "do not use custom parser to parse HTML" is always valid.

answered Nov 29 '16 at 15:17

enrico.bacis

30,497
10
86
115

score 0 · Answer 3 · answered Nov 29 '16 at 15:19

You need to assign the result of your line.strip() to a variable and add that to your other content. Otherwise you will just save the unstripped line.

for line in contents:

    line = line.strip()

    if not line:
        continue
    # We got something, save it
    else:
        new_contents.append(line)

score 0 · Answer 4 · answered Nov 29 '16 at 15:19

It looks like you are trying to remove all HTML tags from a text. You could do that by hand but tags can be complex and even use multiple lines.

My advise would be to use BeautifulSoup which is specially written to process xml and html:

import bs4

# extract content... then
new_content = bs4.BeautifoulSoup(content, 'html.parser').text
print new_content

bs4 module has been extensively tested, copes with many corner cases and highly reduce your own code...

I will try this out. Thanks for your input. – E_R Nov 30 '16 at 01:42 — E_R, Nov 30 '16 at 01:42

How does one strip text between two delimiters including empty lines?

4 Answers4