2

Sorry for posting this again. I am getting this error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 45: ordinal not in range(128) when I run the following code strip_html():

from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)


def strip_tags(html):
    s = MLStripper()
    s.feed(html )
    return s.get_data()

on this string of text:

"<p>We’re implementing the PayPal MECL library in a client’s app but we’re experiencing some poor user experience that we don’t seem to be able to change. \nWhen the PayPal experience is complete, PayPal show a “Please wait while we transfer you to the business site...” message. Obviously this is an iOS app not a “business site”...</p>\n\n<p>The flow functions by dismissing the web view on completion of the PayPal experience by listening for new URL requests within the UIWebViewDelegate method:</p>\n\n<pre><code>- (BOOL)webView:(UIWebView *)webView shouldStartLoadWithRequest:(NSURLRequest *)request navigationType:(UIWebViewNavigationType)navigationType\n</code></pre>\n\n<p>This issue seems to be that PayPal update their web view with the message via editing the DOM (JS or some such) which does not create a new web request and therefor no shouldStartLoadWithRequest fired. Note: A new request is made after a second or so when redirected but that’s too late, the inappropriate copy has been presented to the user.</p>\n\n<p>Has anyone working with MECL on iOS or Android managed to alter this copy/experience either via the <a href=\"https://cms.paypal.com/uk/cgi-bin/?cmd=_render-content&amp;content_ID=developer/e_howto_api_nvp_r_SetExpressCheckout\" rel=\"nofollow\">SetExpressCheckout</a> server call or configuration of the <a href=\"https://cms.paypal.com/uk/cgi-bin/?cmd=_render-content&amp;content_ID=developer/e_howto_api_WPECOnMobileDevices\" rel=\"nofollow\">MECL URL get params</a>?I ’ve been unable to find a resolution on this so far but will post a solution if we find one. Any help would be greatly appreciated as we don’t seem to be able to find a solution in PayPals documentation...</p>\n\n<p><strong>NOTE:</strong> Also we have a similar UX issue when pressing the cancel button on the PayPal web view that causes a redirect, but with a similar bad piece of copy presented before hand “Cancel this purchase and return to the seller’s website?”. This is worded as a confirmation dialogue but there are no buttons presented and it redirects anyway. Mad UX. Again if anyone knows a solution to either if these please post.</p>\n\n<p><img src=\"https://i.stack.imgur.com/gc4zq.png\" alt=\"&quot;Please wait while we transfer you to the business site...&quot; image\"></p>\n\n<p><img src=\"https://i.stack.imgur.com/cztum.png\" alt=\"&quot;Cancel this purchase and return to the seller’s website?&quot; image\"></p>\n"

I am processing 6 millions documents and so far (10% of the way through) I hit the above error message. I can fix this for the above piece of text if I do a.decode("utf-8") before calling the strip_tags function, but my code for the rest of the text stops working.

Any ideas on what I can do? I'm tempted to just use regex to strip the HTML tags (I know that's wrong).

Thank you.

mchangun
  • 9,814
  • 18
  • 71
  • 101
  • 1
    In what way does it break when decoding to UTF-8? – sdasdadas Oct 28 '13 at 17:46
  • http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python – rlms Oct 28 '13 at 17:46
  • @sdasdadas This is the error message `text = text.translate(None, string.digits) TypeError: translate() takes exactly one argument (2 given)` I'm puzzled - does string.translate take different arguments if it's a utf-8 string? – mchangun Oct 28 '13 at 17:51
  • 1
    @mchangun: `unicode.translate()` indeed takes only one argument, a mapping. – Martijn Pieters Oct 28 '13 at 17:53

0 Answers0