0

i am trying to save strings that contain emojies to a .txt file, but I always get an error when running the code.

Code:


I set the .txt file up to have an utf-8 encoding.


subject_proper = subject.text.strip()
subject_proper = subject_proper.decode('utf-8')

Error:

subject_proper = subject_proper.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'

Edit:

if i drop the .decode I get the following error:

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 65-65: Non-BMP character not supported in Tk

Edit 2:

Example text: Christmas treats for the triathletes ⛄

I have scraped the strings from https://milled.com/wiggle-co-uk

This method has worked before, but I dont know why it does not with this code. I have tried to find the answer elsewhere, but unfortunately without success.

I hope someone has an idea :)

HansDampf
  • 121
  • 3
  • 12
  • This might just be the difference between Python 2 and Python 3. – Mark Ransom Dec 13 '19 at 19:57
  • Does this answer your question? ['str' object has no attribute 'decode'. Python 3 error?](https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error) – Juan C Dec 13 '19 at 19:57
  • Decode works on bytes. `b'some text'.decode('utf-8')` will work but `'some text'.decode('utf-8')` will not. – WGriffing Dec 13 '19 at 19:58
  • Perhaps try `subject_proper.encode('unicode-escape').decode('utf-8')` ? Found this answer which may be relevant: https://stackoverflow.com/questions/32442608/ucs-2-codec-cant-encode-characters-in-position-1050-1050 – WGriffing Dec 13 '19 at 20:03
  • I have given that a go and i get the following error: IndexError: string index out of range – HansDampf Dec 13 '19 at 20:04
  • 2
    Please consider adding some of the text you're trying to parse / decode to the question. – Nick Reed Dec 13 '19 at 20:04
  • @WGriffing I have tried that as well. In this case it does work, but there are no more emjies in the document. Just the codes such as \xa340 – HansDampf Dec 13 '19 at 20:05
  • 1
    I figured out what the problem was. The code runs in pycharm without issues, but does not in idle. Removing the print output to the console has fixed the issue. It is now printing to the .txt without issues. – HansDampf Dec 13 '19 at 20:13
  • Does this answer your question? ['UCS-2' codec can't encode characters in position 1050-1050](https://stackoverflow.com/questions/32442608/ucs-2-codec-cant-encode-characters-in-position-1050-1050) – snakecharmerb Dec 14 '19 at 13:50

1 Answers1

1

You're trying to decode a string that has already been decoded. If your file is set to utf-8 but only has ASCII characters in it, I don't think the encoding matters.

Once you have a str, there's no need to decode it anymore. If you drop .decode('utf-8'), the error will likely go away.

If you're expecting code to possibly have utf-8 values, you can surround it with a try-except block to catch an AttributeError, and then act on it accordingly.

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
  • Unfortunatley that does not work. I get the following error: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 65-65: Non-BMP character not supported in Tk – HansDampf Dec 13 '19 at 20:01
  • Possibly consider `subject_proper =subject_proper .encode('unicode-escape').decode('utf-8')`? I'm not sure what characters you're trying to parse, but python doesn't seem to like them. Consider checking [this question out, too.](https://stackoverflow.com/q/32442608/7431860) – Nick Reed Dec 13 '19 at 20:04