1

So I been trying to format the taken webpage from CL so I can send it to my email, but this is what I come up with every time I try anything to remove the \n and \t

b'\n\n\n\t\n\t\n\t\n\t\n\t\n\t\n\n\n\n\t\n\n\n\t
\n\t\t\t
\n\t
\n\t\t
\n\t\t\t
\n 0 favorites\n
\n\n\t\t
\n\t\t
∨
\n\t\t
∧
\n\t\t
\n \n
\n
\n\t \tCL wenatchee all personals casual encounters\n
\n
\n\t\t
\n\t
\n
\n\n\t\t
\n\t\t\t
\n\t\n\t\t\n\t\n\n\n\nReply to: 59nv6-4031116628@pers.craigslist.org\n
\n\n\n\t
\n\t\n\t\tflag [?] :\n\t\t\n\t\t\tmiscategorized\n\t\t\n\t\t\tprohibited\n\t\t\n\t\t\tspam\n\t\t\n\t\t\tbest of\n\t\n
\n\n\t\t

Posted: 2013-08-28, 8:23AM PDT
\n
\n\n
\n \n Well... - w4m - 22 (Wenatchee)\n

I have tried strip, replace and even regex but nothing fazes it, it always comes up in my email unaffected by everything.

Here's the code:

try:
    if url.find('http://') == -1:
        url = 'http://wenatchee.craigslist.org' + url
    html = urlopen(url).read()
    html = str(html)
    html = re.sub('\s+',' ', html)
    print(html)
    part2 = MIMEText(html, 'html')
    msg.attach(part2)
    s = smtplib.SMTP('localhost')
    s.sendmail(me, you, msg.as_string())
    s.quit()
Henry Keiter
  • 16,863
  • 7
  • 51
  • 80
  • This code doesn't run and your post is practically unformatted. Format your question, and post a [short, self-contained example](http://sscce.org/) that we can copy and paste to reproduce your problem, or else you're unlikely to get any help. – Henry Keiter Aug 28 '13 at 23:22

1 Answers1

6

Your issue is that despite all evidence to the contrary, you still have a bytes object rather than the str that you're hoping for. Thus your attempts come to nothing because without an encoding specified, there's no way to match anything (regexes, replacement parameters, etc) to your html string.

What you need to do is decode the bytes first.

And personally, my favorite method for cleaning up whitespace is to use string.split and string.join. Here's a working example. I remove all runs of any kind of whitespace, and replace them with single spaces.

try:
    html = urlopen('http://wenatchee.craigslist.org').read()
    html = html.decode("utf-8") # Decode the bytes into a useful string
    # Now split the string over all whitespace, then join it together again.
    html = ' '.join(html.split())
    print(html)
    s.quit()
except Exception as e:
    print(e)
Community
  • 1
  • 1
Henry Keiter
  • 16,863
  • 7
  • 51
  • 80