File contain \u00c2\u00a0, convert to characters

Question

I have a JSON file which contains text like this

 .....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

My simple question is how CONVERT (not remove) these \u codes to spaces, apostrophes and e.t.c...?

Input: a text file with .....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

Output: .....wax, and voila!(converted to the line break)At the moment you can't use our ...

Python code

def TEST():
        export= requests.get('https://sample.uk/', auth=('user', 'pass')).text

        with open("TEST.json",'w') as file:
            file.write(export.decode('utf8'))

What I have tried:

Using .json()
any different ways of combining .encode().decode() and e.t.c.

Edit 1

When I upload this file to BigQuery I have - Â symbol

Bigger Sample:

{
    "xxxx1": "...You don\u2019t nee...",
    "xxxx2": "...Gu\u00e9rer...",
    "xxxx3": "...boost.\u00a0Sit back an....",
    "xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
    "xxxx5": "\u00a0\n\u00a0",
    "xxxx6": "It was Christmas Eve babe\u2026",
    "xxxx7": "It\u2019s xxx xxx\u2026"
}

Python code:

import json
import re
import codecs


def load():
    epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
    x = json.loads(re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))

    with open("TEST.json", "w") as file:
        json.dump(x,file)

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)



if __name__ == '__main__':
    load()

i know you've said that you have to use python 2, but can i just ask why in general? If it's some kind of requirement because of existing code, i'd highly recommend that you push for a change to python 3 if at all possible. — Paritosh Singh, Jul 09 '19 at 14:55
Exact inputs and outputs, please. See https://stackoverflow.com/help/minimal-reproducible-example. — chepner, Jul 09 '19 at 14:56
@ParitoshSingh .... you are right, Python 3 will be better ( just read it ) — Oksana Ok, Jul 09 '19 at 15:01
When you look at this section of the JSON string, are there *actual* backslashes and digits in this location? Or, tested the other way around, when you do `print(export.replace(r'\u00c2\u00a0', ''))`, are they gone? — Tomalak, Jul 09 '19 at 15:21
@Tomalak they part of the text, and yes your command will work. — Oksana Ok, Jul 10 '19 at 07:13
@0andriy the output is from URL, by using get method which contains this \u codes (not sure what exactly used).. it was just examples which I have used — Oksana Ok, Jul 10 '19 at 07:15
It looks like the source accidentally passed UTF-8 strings to its JSON encoder. You will need to first JSON-decode the string to a data structure, then UTF-8-decode each string separately. — Botje, Jul 10 '19 at 07:23
@Botje As per the OP, there are literal backslashes and digits in the string. Those are not escape sequences, there is nothing to decode. — Tomalak, Jul 10 '19 at 08:01
`\u00c2\u00a0` is the JSON representation of the bytes `c2 a0`, which is the UTF-8 encoding of the unicode character U+00A0. Had the source done their work correctly, the JSON string would either contain `\u00a0` or the bytes `c2 a0`. — Botje, Jul 10 '19 at 08:06
@Botje JSON string contains - \u00a0 which is part of the JSON string ( I know this is bad, but no choice to change the source) — Oksana Ok, Jul 10 '19 at 08:08
@OksanaOk Either manually interpret the unicode escapes such that JSON decoding will see UTF-8 and do the right thing, or decode strings *after* JSON decoding. If possible, nag at "sample.uk" and tell them they're producing garbage. — Botje, Jul 10 '19 at 08:29

Tomalak · Accepted Answer · 2019-07-10T11:13:40.620

3

I have made this crude UTF-8 unmangler, which appears to solve your messed-up encoding situation:

import codecs
import re
import json

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)

Usage:

broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)

converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)

data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])

It uses regex to pick up the hex sequences from your string, converts them to individual bytes and decodes them as UTF-8.

For the sample string above (I've included the 3-byte character € as a test) this prints:

Broken JSON
 {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
 {"some_key": "... ’ wax, and voila!  At the moment you can't use our € ..."}
Parsed data
 {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
 ... ’ wax, and voila!  At the moment you can't use our € ...

The \xa0 in the "Parsed data" is caused by the way Python outputs dicts to the console, it still is the actual non-breaking space.

edited Jul 10 '19 at 11:13

answered Jul 10 '19 at 09:15

Tomalak

332,285
67
532
628

`Could not decode buffer: b'\x99\xc2'` `Could not decode buffer: b'\xa0\xc2\xa0'` `Could not decode buffer: b'\xe7\xe2\x80'` And e.t.c. – Oksana Ok Jul 10 '19 at 09:24
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 408962: character maps to – Oksana Ok Jul 10 '19 at 09:24
Well, that would mean that your input is more broken (or different) than assumed. You need to show a larger sample. – Tomalak Jul 10 '19 at 09:45
(...and I've greatly simplified the unmangle function, I was thinking way to complicated at first) – Tomalak Jul 10 '19 at 11:08
`Could not decode buffer: b'\xe7'`, `Could not decode buffer: b'\xed'` - \u codes looks tidier, but have no idea why they still not converted (( – Oksana Ok Jul 11 '19 at 07:49
As I said, you must provide a larger sample. Quoting the error messages without any context does not help me at all. – Tomalak Jul 11 '19 at 07:52
Sorry, my apology. Have updated the question. Thank you in advanced – Oksana Ok Jul 11 '19 at 08:07
I'm sorry, but do you have any suggestions based on new sample data? – Oksana Ok Jul 11 '19 at 10:28
Please provide a sample that is valid Python code and produces the errors you see, to rule out any ambiguity. (The flurry of comments directly under your question was mostly because your initial example was ambiguous, too.) – Tomalak Jul 11 '19 at 10:39
Please have a look for python code. Maybe I have done some stupid mistake – Oksana Ok Jul 11 '19 at 10:54
Okay, that's better, but the TEST.json file is still missing. – Tomalak Jul 11 '19 at 10:56
TEST.json is an empty file which populated from request feed. For testing, sample text can be used - "Bigger Sample:" – Oksana Ok Jul 11 '19 at 11:03
No, because that's not valid JSON. – Tomalak Jul 11 '19 at 11:03
Have amended to make a valid JSON – Oksana Ok Jul 11 '19 at 11:12
Okay, and now we are back at where we started. *This* JSON is not broken at all. It parses just fine as it is, without using any "unmangle" function - try it. And on top of that, it does not contain `\u00c2\u00a0`, but that is what you started your question about. So... I'm still confused. – Tomalak Jul 11 '19 at 11:20
I can add \u00c2\u00a0, there are a lot of them, but even when I try my own example it is not converted. If you are telling me it does work on your side, I'm completely lost.... – Oksana Ok Jul 11 '19 at 11:44
Again. Write a self-contained code sample that produces the error you see. Don't include any code from any of the answers here. I want to see the original error this question was about. Make something that I can copy and paste 1:1 and get the same error when I run it. (You should have done that from the start, it's called [mcve].) What we are doing right now in the comments is completely inefficient and leads nowhere. – Tomalak Jul 11 '19 at 11:53
Have done copy paste code, there is no ERROR, the issue is when you open TEST.json , there are still \u codes – Oksana Ok Jul 11 '19 at 12:07
Of course you do. That's how JSON works. Every Unicode character can be encoded as a \u code, that's normal. – Tomalak Jul 11 '19 at 13:59

score 2 · Answer 2 · answered Jul 10 '19 at 08:48

2

The hacky approach is to remove the outer layer of encoding:

import re
# Assume export is a bytes-like object
export = re.sub(b'\\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)

This matches the escaped UTF-8 bytes and replaces them with actual UTF-8 bytes . Writing the resulting bytes-like object to disk (without further decoding!) should result in a valid UTF-8 JSON file.

Of course this will break if the file contains genuine escaped unicode characters in the UTF-8 range, like \u00e9 for an accented "e".

answered Jul 10 '19 at 08:48

Botje

26,269
3
31
41

It is contain like \u00e9 as well, ((( – Oksana Ok Jul 10 '19 at 09:03
Does it appear by itself or is there something in front of it? – Botje Jul 10 '19 at 09:07
Like this - `\u00c3\u00a9` , `\u00e2\u0080\u0099` – Oksana Ok Jul 10 '19 at 09:09
`\u00e2\u0080\u0099` is all-right for this character. It's a curly quote. `’’` – Tomalak Jul 10 '19 at 09:10
Easiest way to check is just to try to `.decode()` the resulting bytes object as a test. If it is malformed UTF-8, Python will complain. – Botje Jul 10 '19 at 09:12
Certainly less complicated than my approach... :) – Tomalak Jul 10 '19 at 09:19

score 1 · Answer 3 · answered Jul 10 '19 at 09:31

As you try to write this in a file named TEST.json, I will assume that this string is a part of a larger json string.

Let me use an full example:

js = '''{"a": "and voila!\\u00c2\\u00a0At the moment you can't use our"}'''
print(js)

{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}

I would first load that with json:

x = json.loads(js)
print(x)

{'a': "and voila!Â\xa0At the moment you can't use our"}

Ok, this now looks like an utf-8 string that was wrongly decoded as Latin1. Let us do the reverse operation:

x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])

{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our

Ok, it is now fine and we can convert it back to a correct json string:

print(json.dumps(x))

{"a": "and voila!\\u00a0At the moment you can\'t use our"}

meaning a correctly encoded NO-BREAK SPACE (U+00A0)

TL/DR: what you should do is:

# load the string as json:
js = json.loads(request)

# identify the string values in the json - you probably know how but I don't...
...

# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')

# convert back to a json string
request = json.dumps(js)

File contain \u00c2\u00a0, convert to characters

3 Answers3

Linked