-1

I am trying to read a text file which has unicode characters (u) and other tags (\n, \u) in the text, here is an example:

(u'B9781437714227000962', u'Definition\u2014Human papillomavirus (HPV)\u2013related proliferation of the vaginal mucosa that leads to extensive, full-thickness loss of maturation of the vaginal epithelium.\n')

How can remove these unicode tags using python3 in Linux operating system?

Bade
  • 747
  • 3
  • 12
  • 28
  • 2
    It looks like you are in a [pickle](https://stackoverflow.com/a/899199/2226988). – Tom Blodget Apr 05 '18 at 16:43
  • Why is my question downvoted?? The file doesn't say it's a pickle neither the person who sent to me told it is. Anyway, I tried unpickling file but it gave me an error:_pickle.UnpicklingError: unpickling stack underflow – Bade Apr 12 '18 at 19:46

1 Answers1

1

To remove unicode escape sequence (or better: to translate them), in python3:

a.encode('utf-8').decode('unicode_escape')

The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.

But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode "unescaping" part.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32