3

I have a very large JSON file. Most of it is valid JSON data, but parts of it are not. The following is a simplification of my case:

[
    "this is valid: \ud835\udc47",
    "this is invalid: \ud835",
]

The first item is valid and will be successfully parsed, but when the second item is attempted the deserialization will fail because UTF-8 doesn't allow the \ud835 character at all while UTF-16 doesn't allow a lone \ud835 character as it needs to be followed by another hex escape.

This issue has occurred when using a HTTP server that uses Python's built-in JSON deserializer and saved the data to a database. Python's deserializer accepted a lone "\ud835" character which is not valid UTF-8 or UTF-16. Now when we want to migrate this application and database to Rust with serde it catches this invalid UTF-8/16 string.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Johan Bjäreholt
  • 731
  • 1
  • 11
  • 24
  • 1
    Maybe you can deserialize each element into a binary buffer (or [rmp_serde::Raw](https://docs.rs/rmp-serde/0.13.7/rmp_serde/struct.Raw.html)) and transform each with [`std::str::from_utf8_lossy`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy) – Alexey S. Larionov Sep 29 '20 at 07:07
  • 2
    You can implement a custom deserialization scheme, which would allow performing whatever error recovery you see fit. Alternatively, you can preprocess the JSON data before passing it to serde. – Masklinn Sep 29 '20 at 07:08
  • 1
    I think your easiest bet is to write some ad-hoc Python code to sanitize the data. Python's json module should be able to deserialize the data, since it wrote it. You can then find all incomplete surrogate pairs in all strings, fix them in whatever way you want, and write valid JSON back. – Sven Marnach Sep 29 '20 at 07:11
  • The solution was a little bit different for arrays, but otherwise the duplication mark is correct. Here's my solution to the issue https://github.com/serde-rs/serde/issues/1583#issuecomment-706658175 – Johan Bjäreholt Oct 26 '20 at 18:04

0 Answers0