Most diffable data interchange format?

Question

JSON is great because it has wide support, and it's easy for both machines and humans to read and write.

YAML is great because it's even easier for humans to read and write, and it has support for more data types.

TOML is like an improved version of INI.

I want to optimize for something different: diffability. i.e., how easy is it to understand what changed between two versions of the same document when ran through a standard diff tool?

As far as I can tell, Yarn went so far as to create their own custom format for their lock files just to improve this aspect.

Are there any open-source JS libraries for producing diffable output from an object?

Are you asking for a data format, a diff library or both? I also think that the answer depends entirely on the nature and purpose of the information to be diffed. — Álvaro González, Mar 02 '17 at 17:30
@ÁlvaroGonzález data/file format. The nature of the data is unknown. I want to diff anything that might be returned from an HTTP REST API. I'll convert it from JSON, or whatever it's in, into this new data interchange format, and then diff it. The idea is to be able to see how it changes over time. — mpen, Mar 02 '17 at 20:01
I've seen people tracking changes to German laws using Git and markdown was the format of choice because they are basically very large blocks of plain text. But tracking e.g. the evolution of a product catalogue is a very different use case that would probably require a different format. That's what I meant. — Álvaro González, Mar 03 '17 at 08:15
@ÁlvaroGonzález Fair enough, but I am talking about a "data interchange format" not a human language. Assume there is some structure to my data :-) — mpen, Mar 03 '17 at 16:13
I've always assumed that. But most diff algorithms are line-based and it's not the same to have e.g. a JSON file with program settings than a JSON file with EULAs. — Álvaro González, Mar 04 '17 at 09:35
Well, I'm not looking for the perfect be-all-end-all solution here. There will always be some degenerate cases where it doesn't perform so well. However, for the purposes of the question, one can assume there won't be large paragraphs in the dataset. — mpen, Mar 04 '17 at 18:33

Gabriel · Answer 1 · 2021-10-24T13:36:45.413

Canonicalized then prettified JSON

Canonicalization normalizes the type serialization and sorts the fields.

Prettifying adds back white space and line separators.

We need to come up with a standard for prettify.

I would like to see a YAML equivalent of this diffability. Maybe that is as simple as just converting from YAML to JSONC then converting the canonicalized JSONC back to YAML. The JSONC to YAML conversion process will also need to be standardized. A JSONC canonicalizer might not exist yet. Definitely not this simple.

Note: Prettifying makes it no longer canonical, but is necessary for diffability.

The RFC offers a sample ES6 JSON canonicalizer.

The following Open Source implementations have been verified to be compatible with JCS:

JavaScript: https://www.npmjs.com/package/canonicalize

Java: https://github.com/erdtman/java-json-canonicalization

Go: https://github.com/cyberphone/json-canonicalization/tree/master/go

.NET/C#: https://github.com/cyberphone/json-canonicalization/tree/master/dotnet

Python: https://github.com/cyberphone/json-canonicalization/tree/master/python3

— Open Source Implementations

Canonicalize

Raw

  {
    "numbers": [333333333.33333329, 1E30, 4.50,
                2e-3, 0.000000000000000000000000001],
    "string": "\u20ac$\u000F\u000aA'\u0042\u0022\u005c\\\"\/",
    "literals": [null, true, false]
  }

Remove whitespace and normalize serialization

{"numbers":[333333333.3333333,1e+30,4.5,0.002,1e-27],"string":"EURO$\u000f\nA'B\"\\\\\"/","literals":[null,true,false]}

Sort

{"literals":[null,true,false],"numbers":[333333333.3333333,1e+30,4.5,0.002,1e-27],"string":"EURO$\u000f\nA'B\"\\\\\"/"}

Prettify

{
  "literals": [
    null,
    true,
    false
  ],
  "numbers": [
    333333333.3333333,
    1e+30,
    4.5,
    0.002,
    1e-27
  ],
  "string": "EURO$\u000f\nA'B\"\\\\\"/"
}

Canonicalizing YAML would be a nightmare, what with its [63 different ways to write a string](https://stackoverflow.com/a/21699210/157957)... — IMSoP, Oct 22 '21 at 17:45

Most diffable data interchange format?

1 Answers1

Canonicalize

Raw

Remove whitespace and normalize serialization

Sort

Prettify