how to find and replace unicode in json files?

Question

I have folder of json files (approx 70 GB data), these json files are emails. I want to open all the files and find Unicodes using python. later I want to replace those Unicodes with any regular expression. Could you please provide a layout that I can follow through?

I am doing this to get rid of the error:

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

I understand that this question might sound bit vague but please free to ask any doubts regarding the question.

Any help would be much appreciated :)

Can you provide a concise example of the problem and what are you trying to do exactly? — AKA, Mar 31 '21 at 06:05
Possibly relevant https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python — snakecharmerb, Mar 31 '21 at 06:14
Similar error https://stackoverflow.com/questions/66011007/valueerror-unpaired-high-surrogate-when-decoding-string-on-reading-json-file — snakecharmerb, Mar 31 '21 at 06:16
@Wave thank you for looking into it.. I have a dataset of emails stored as json files. the main aim is to perform tf-idf (to get high ranking words) on the dataset using pyspark. for that I need to first read and write json files in pandas df as record oriented format because that can be easily dealt in pyspark. while reading some of them I get `ValueError: Unpaired high surrogate when decoding 'string' on reading json file`. my idea is that the email body has unicode that is causing problem. — Samiksha, Mar 31 '21 at 08:06
@snakecharmerb thank you for looking into it. actually the second link that you shared has been posted by me only :) — Samiksha, Mar 31 '21 at 08:09
I think you can manually read the json files using `open("file.json", "r")` which gives you string then encode it by replacing surrogates and then decode...something like `"string with surrogate chars".encode("utf-8", errors="replace").decode("utf-8")` — AKA, Mar 31 '21 at 10:35
I created a JSON with valid surrogate pairs and unpaired surrogates and it reads fine (Python 3.8.8). *Printing* a string with an unpaired surrogate, however, gives a `UnicodeEncodeError`. Please edit your question to show a [mcve] that produces the error you see. — Mark Tolonen, Mar 31 '21 at 23:30
@Wave, I tried the solution you provides but it still throws the same error — Samiksha, Apr 02 '21 at 19:53
@MarkTolonen thank you for looking into it. I used `try` and `except`to check which files have this error and also looked into the files and found multiple value with \u and \\u (ex. \u2013 , \u2019 , \u00a010 , \u00a9 , \u00bb , \\u001b) . exact example be: `5000\u001b$B1_0J>e$N\u001b(B\n\u001b$BA4$F$N9XF~#P#T$,#3!%#5G\\\u001b` any suggestions on this? — Samiksha, Apr 02 '21 at 19:57
You'll have to edit the question and show a small sample of JSON and the code that reads it and gets an error. Make a [mcve]. — Mark Tolonen, Apr 02 '21 at 20:03
FYI the sample in your comment, as a raw string and with double quotes around the string, loads fine. So please make an unambiguous [mcve]. — Mark Tolonen, Apr 02 '21 at 20:15

how to find and replace unicode in json files?

0 Answers0