Reading a text file with unicode characters - Python3

Question

I am trying to read a text file which has unicode characters (u) and other tags (\n, \u) in the text, here is an example:

(u'B9781437714227000962', u'Definition\u2014Human papillomavirus (HPV)\u2013related proliferation of the vaginal mucosa that leads to extensive, full-thickness loss of maturation of the vaginal epithelium.\n')

How can remove these unicode tags using python3 in Linux operating system?

It looks like you are in a [pickle](https://stackoverflow.com/a/899199/2226988). — Tom Blodget, Apr 05 '18 at 16:43
Why is my question downvoted?? The file doesn't say it's a pickle neither the person who sent to me told it is. Anyway, I tried unpickling file but it gave me an error:_pickle.UnpicklingError: unpickling stack underflow — Bade, Apr 12 '18 at 19:46

score 1 · Accepted Answer · answered Apr 06 '18 at 07:55

To remove unicode escape sequence (or better: to translate them), in python3:

a.encode('utf-8').decode('unicode_escape')

The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.

But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode "unescaping" part.

Reading a text file with unicode characters - Python3

1 Answers1