0

I have some text that I am trying to decode and encode in Python

import html.parser

original_tweet = "I luv my <3 iphone & you’re awsm 
                 apple.DisplayIsAwesome, sooo happppppy  
                 http://www.apple.com"
tweet = original_tweet.decode("utf8").encode('ascii', 'ignore')

I have entered the original tweet on one line in Spyder (Python 3.6)

I get the following message

AttributeError: 'str' object has no attribute 'decode'

Is there an alternative way to rewrite this code for Python 3.6?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
cordelia
  • 133
  • 3
  • 12
  • 5
    You seem to be confused what a string in Python represents and what encoding or decoding does. Encoding turns a string into bytes, decoding the opposite. In that light, your call doesn't make sense and hence it also fails. – Ulrich Eckhardt Mar 10 '18 at 09:39
  • This is the website I am following and am unable to understand what is going on: https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/ – cordelia Mar 10 '18 at 09:41
  • 2
    You can not use [`str.encode()`](https://docs.python.org/3/library/stdtypes.html#str.encode) and [`bytes.decode()`](https://docs.python.org/3/library/stdtypes.html#bytes.decode) to handle the HTML entities `<` and `&` if that’s what you’re trying to do. Look into libs like [Parsing HTML with lxml](http://lxml.de/parsing.html#parsing-html) for that (based on you importing a HTML parser). However, your string `original_tweet` isn’t proper HTML, so you may consider fudging that first… – Jens Mar 10 '18 at 09:43
  • @cordelia That website's code does not make any sense. If your `original_tweet` value is a character string already, there's no need to encode or decode it. If it's a byte string (i.e. a `bytes` object), `decode` it once to get a character string. – phihag Mar 10 '18 at 09:44
  • I believe that the code on that website was written for Python 2. There, a regular string (without `u` prefix) is a byte sequence, which can be decoded. – Ulrich Eckhardt Mar 10 '18 at 09:47
  • Thanks a lot for your comments. This has put me on the right track and prevented me from going round in circles. – cordelia Mar 10 '18 at 09:48
  • How do I tweak this for Python 3.6 then? SHould I put a u in front of the original_tweet code? – cordelia Mar 10 '18 at 09:48
  • @cordelia what are you trying to achieve? Every string in Py3+ is a UTF8 encoded Unicode string already. – Jens Mar 10 '18 at 09:49
  • I need to transform the data and change the encoding format. – cordelia Mar 10 '18 at 09:50
  • @cordelia Change the encoding format to _what_? – Jens Mar 10 '18 at 09:51
  • I thinnk ascii is the format – cordelia Mar 10 '18 at 09:56
  • @cordelia, that’s unlikely to work considering the Unicode emoji in the original string which can not be represented in plain [ASCII](http://www.asciitable.com/mobile/). Take a look at [this](https://stackoverflow.com/questions/4299675/python-script-to-convert-from-utf-8-to-ascii) or [this](https://stackoverflow.com/questions/2365411/convert-unicode-to-ascii-without-errors-in-python) question to convert the UTF8 encoded string `original_tweet` into a plain ASCII string. – Jens Mar 10 '18 at 09:58
  • Thank you @Jens but how do I do this in Python 3.6 everything seems to be done in Python 2.7. – cordelia Mar 10 '18 at 10:01
  • If you can live some data loss then `original_tweet.encode("utf-8").decode("ascii", errors="ignore")` should work. First, [encode()](https://docs.python.org/3/library/stdtypes.html#str.encode) the string into an array of bytes, then [decode()](https://docs.python.org/3/library/stdtypes.html#bytes.decode) that array and dismiss possible decode errors. – Jens Mar 10 '18 at 10:06
  • That seems to work except it does not correct the &lt and &amp the way it is corrected on the website. – cordelia Mar 10 '18 at 10:12
  • Do you recommend my using a parser on original_tweet and then applying your encode and decode code to that? – cordelia Mar 10 '18 at 10:13
  • 'code'original_tweet = "I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy http://www.apple.com" tweet = html.parser.unescape(original_tweet) print (tweet) 'code' – cordelia Mar 10 '18 at 10:13
  • How do I avoid losing the 're in the you're ? I have reposted this because I posted it in the Answer section by mistake. – cordelia Mar 10 '18 at 11:05

1 Answers1

1

In Python3+, your original_tweet string is a UTF-8 encoded Unicode string containing a Unicode emoji. Because the 65k+ characters in Unicode are a superset of the 256 ASCII characters, you can not simply convert a Unicode string into an ASCII string.

However, if you can live with some data loss (i.e. drop the emoji) then you can try the following (see this or this related question):

original_tweet = "I luv my &lt;3 iphone &amp; you’re awsm ..."

# Convert the original UTF8 encoded string into an array of bytes.
original_tweet_bytes = original_tweet.encode("utf-8")

# Decode that array of bytes into a string containing only ASCII characters;
# pass errors="strict" to find failing character mappings, and I also suggest
# to read up on the option errors="replace".
original_tweet_ascii = original_tweet_bytes.decode("ascii", errors="ignore")

Or as a simple one-liner:

tweet = original_tweet.encode("utf-8").decode("ascii", errors="ignore")

Note that this does not convert the HTML entities &lt; and &amp; which you may have to address separately. You can do that using a proper HTML parser (e.g. lxml), or use a simple string replacement:

tweet = tweet.replace("&lt;", "<").replace("&amp;", "&")

Or as of Python 3.4+ you can use html.unescape() like so:

tweet = html.unescape(tweet)

See also this question on how to handle HTML entities in strings.

Addendum. The Unidecode package for Python seems to provide useful functionality for this, too, although in its current version it does not handle emojis.

Jens
  • 8,423
  • 9
  • 58
  • 78
  • Thank you so much for helping me with this. That truly resolves my query. – cordelia Mar 10 '18 at 10:23
  • How do I avoid losing the 're for the you're? Apologies for bugging you with this but I just noticed it. – cordelia Mar 10 '18 at 10:57
  • @cordelia, the `’` character is Unicode character [U+2019](https://www.fileformat.info/info/unicode/char/2019/index.htm) and has no direct equivalent in ASCII. What you can do, however, is to use [`str.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace) to replace all `‘` and `’` with ASCII `'` and the double quotation marks `“` and `”` with ASCI `"`. See also this question: [Replacing unicode punctuation with ASCII approximations](https://stackoverflow.com/questions/4808967/replacing-unicode-punctuation-with-ascii-approximations). – Jens Mar 10 '18 at 11:13
  • Thanks for this @Jens – cordelia Mar 10 '18 at 11:19