0

I have folder of json files (approx 70 GB data), these json files are emails. I want to open all the files and find Unicodes using python. later I want to replace those Unicodes with any regular expression. Could you please provide a layout that I can follow through?

I am doing this to get rid of the error:

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

I understand that this question might sound bit vague but please free to ask any doubts regarding the question.

Any help would be much appreciated :)

petezurich
  • 9,280
  • 9
  • 43
  • 57
Samiksha
  • 59
  • 6
  • 2
    Can you provide a concise example of the problem and what are you trying to do exactly? – AKA Mar 31 '21 at 06:05
  • Possibly relevant https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python – snakecharmerb Mar 31 '21 at 06:14
  • Similar error https://stackoverflow.com/questions/66011007/valueerror-unpaired-high-surrogate-when-decoding-string-on-reading-json-file – snakecharmerb Mar 31 '21 at 06:16
  • @Wave thank you for looking into it.. I have a dataset of emails stored as json files. the main aim is to perform tf-idf (to get high ranking words) on the dataset using pyspark. for that I need to first read and write json files in pandas df as record oriented format because that can be easily dealt in pyspark. while reading some of them I get `ValueError: Unpaired high surrogate when decoding 'string' on reading json file`. my idea is that the email body has unicode that is causing problem. – Samiksha Mar 31 '21 at 08:06
  • @snakecharmerb thank you for looking into it. actually the second link that you shared has been posted by me only :) – Samiksha Mar 31 '21 at 08:09
  • I think you can manually read the json files using `open("file.json", "r")` which gives you string then encode it by replacing surrogates and then decode...something like `"string with surrogate chars".encode("utf-8", errors="replace").decode("utf-8")` – AKA Mar 31 '21 at 10:35
  • I created a JSON with valid surrogate pairs and unpaired surrogates and it reads fine (Python 3.8.8). *Printing* a string with an unpaired surrogate, however, gives a `UnicodeEncodeError`. Please edit your question to show a [mcve] that produces the error you see. – Mark Tolonen Mar 31 '21 at 23:30
  • @Wave, I tried the solution you provides but it still throws the same error – Samiksha Apr 02 '21 at 19:53
  • @MarkTolonen thank you for looking into it. I used `try` and `except`to check which files have this error and also looked into the files and found multiple value with \u and \\u (ex. \u2013 , \u2019 , \u00a010 , \u00a9 , \u00bb , \\u001b) . exact example be: `5000\u001b$B1_0J>e$N\u001b(B\n\u001b$BA4$F$N9XF~#P#T$,#3!%#5G\\\u001b` any suggestions on this? – Samiksha Apr 02 '21 at 19:57
  • You'll have to edit the question and show a small sample of JSON and the code that reads it and gets an error. Make a [mcve]. – Mark Tolonen Apr 02 '21 at 20:03
  • FYI the sample in your comment, as a raw string and with double quotes around the string, loads fine. So please make an unambiguous [mcve]. – Mark Tolonen Apr 02 '21 at 20:15

0 Answers0