0

I have a JSON file full of different kinds of characters. I’m using it for an NLP project. I need to load the text into a dictionary, and then write the keys as they are into another file for some extra pre-processing. The text in question is a mix of numbers, alphabetical characters and code points. The issue is that when I write the dictionary into a text file, it changes the code points into strings if that makes sense. So \u00a1 becomes ¡ and \u00a2 becomes ¢ and so on and so forth. I’d like to write in the code points, not their string representations.

The file in question I am trying to process is here: https://storage.googleapis.com/gpt-2/encoder.json

This is the code I have been using to write the dictionary into a text file.

import os
import json

with open(r" file/path/to/encoder.json") as f:
   encoder = json.load(f)
   file1 = open(r"file/path/to/file.txt","a", encoding="utf-8")
   for key in encoder:
      file1.write(key + " " + str(encoder[key]) + '\n')

How do I write the code points without changing them?

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
junfanbl
  • 451
  • 3
  • 21
  • You want the code points written like literal strings. I.E `print(r'\u00a2')` If so by giving the letter r before the string tells python to treat this as a raw string and not interpret any of the special meanings – Chris Doyle Apr 01 '20 at 12:04
  • The link is broken, which is why links are discouraged on SO. Post a sample of the actual document in the question itself. – Mark Tolonen Apr 01 '20 at 16:04
  • The best way is to use pickling. With this you can retrieve your original data, no matter what type of data you store in a file, you will recieve the same type while loading it – tbhaxor Apr 02 '20 at 07:15
  • @GurkiratSingh Unless you are worried about security, and the data is untrusted. Read the big warning in the [pickle docs](https://docs.python.org/3/library/pickle.html). – Mark Tolonen Apr 02 '20 at 07:18

2 Answers2

2

JSON writes those Unicode escape codes if written with ensure_ascii=True by the json library. It translates them back to Unicode code points when the file is loaded again.

Example:

>>> s = '\u00a1Hello!' # This is an escape code.  It becomes a single code point in the string.
>>> print(s)
¡Hello!
>>> import json
>>> j = json.dumps(s) # default is ensure_ascii=True
>>> print(j) # Non-ASCII code points are written as escape codes.
"\u00a1Hello!"
>>> s = json.loads(j) # Converts back to code points
>>> print(s)
¡Hello!
>>> s = r'\u00a1Hello!' # a raw string does not process escape code.
>>> print(s)
\u00a1Hello!
>>> j = json.dumps(s) 
>>> print(j) # JSON escapes the backslash so it is written literally to the file.
"\\u00a1Hello!"
>>> s = json.loads(j)
>>> print(s)
\u00a1Hello!

So to work as you want with JSON, the data needs to be written properly to begin with.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

I found a file that is like what OP is referring to, if not exactly, encoder.json.

Looking in the file, I can see some of the text referenced in OP's question:

{... "\u00a1": 94, "\u00a2": 95, ...}

And if I run OP's code for converting encoder.json into file.txt I do see the effect of "changing the code points ("\u00a1") into strings ("¡")".

But, that shouldn't be a problem, because they mean the same thing:

>>> print("¡ 94\n¢ 95")
¡ 94
¢ 95
>>> print("\u00a1 94\n\u00a2 95")
¡ 94
¢ 95
 >>> "¡ 94\n¢ 95"=="\u00a1 94\n\u00a2 95"
True

That the characters are encoded as unicode escape sequences in the original JSON file is just a detail of how Python's JSON encoder works (with its default of ensure_ascii=True):

>>> json.dumps({"¡": 94, "¢": 95})
'{"\\u00a1": 94, "\\u00a2": 95}'

>>> json.dumps({"¡": 94,"¢": 95}, ensure_ascii=False)
'{"¡": 94, "¢": 95}'

If you're using Python2, it's just a little different (and maybe more confusing) with the u"..." prefix:

>>> print("¡ 94\n¢ 95")
¡ 94
¢ 95
>>> print(u"\u00a1 94\n\u00a2 95")
¡ 94
¢ 95
>>> u"¡ 94\n¢ 95"==u"\u00a1 94\n\u00a2 95"
True

>>> # But this is the same
>>> json.dumps({"¡": 94, "¢": 95})
'{"\\u00a1": 94, "\\u00a2": 95}'

>>> # But this is a little different
>>> json.dumps({"¡": 94,"¢": 95}, ensure_ascii=False)
'{"\xc2\xa1": 94, "\xc2\xa2": 95}'

>>> # But they !! all **mean** the same thing !!
>>> \
... json.loads('{"\xc2\xa1": 94, "\xc2\xa2": 95}') == \
... json.loads('{"\\u00a1": 94, "\\u00a2": 95}') == \
... json.loads('{"¡": 94, "¢": 95}')
True

Based on what I've read in this gpt-2 issue:

The encoder code doesn't like spaces, so they replace spaces and other whitespace characters with other unicode bytes. See encoder.py for details.

having a text file that looks like the following would probably mess up your vocabulary:

...
\u00a1 94
\u00a2 95
...

Have you had an actual problem using file.txt in your NLP processing chain?

Zach Young
  • 10,137
  • 4
  • 32
  • 53