0

I am trying to remove text between these two delimiters: '<' & '>'. I am reading email content and then writing that content to a .txt file.I get a lot of junk between those two delimiters including whitespace between lines in my .txt file. How do I get rid of this? Below is what my script has been reading from the data written to my .txt file:

 First Name</td>

                <td bgcolor='white' style='padding:5px

 !important;'>Austin</td>

                </tr><tr>

                <td bgcolor='#f9f9f9' style='padding:5px !important;'

 valign='top' width=170>Last Name</td>

Below is my current code for reading from the .txt file which strips empty lines:

    # Get file contents
    fd = open('emailtext.txt','r')
    contents = fd.readlines()
    fd.close()

    new_contents = []

    # Get rid of empty lines
    for line in contents:
        # Strip whitespace, should leave nothing if empty line was just       "\n"
        if not line.strip():
            continue
        # We got something, save it
        else:
            new_contents.append(line)

    for element in new_contents:
        print element

Here is what is expected:

 First Name     Austin      


 Last Name      Jones       
E_R
  • 33
  • 6

4 Answers4

0
markup = '<td bgcolor='#f9f9f9' style='padding:5px !important;'

 valign='top' width=170>Last Name</td>'
soup = BeautifulSoup(markup)
soup.get_text()

you can use BeautifulSoup

backtrack
  • 7,996
  • 5
  • 52
  • 99
0

You should consider using a regex and the re.sub function:

import re
print re.sub(r'<.*?>', '', text, re.DOTALL)

Even if the suggestion "do not use custom parser to parse HTML" is always valid.

enrico.bacis
  • 30,497
  • 10
  • 86
  • 115
0

You need to assign the result of your line.strip() to a variable and add that to your other content. Otherwise you will just save the unstripped line.

for line in contents:

    line = line.strip()

    if not line:
        continue
    # We got something, save it
    else:
        new_contents.append(line)
MrLeeh
  • 5,321
  • 6
  • 33
  • 51
0

It looks like you are trying to remove all HTML tags from a text. You could do that by hand but tags can be complex and even use multiple lines.

My advise would be to use BeautifulSoup which is specially written to process xml and html:

import bs4

# extract content... then
new_content = bs4.BeautifoulSoup(content, 'html.parser').text
print new_content

bs4 module has been extensively tested, copes with many corner cases and highly reduce your own code...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252