Is there a well-defined difference between "normalizing" and "canonicalizing" data?

Question

I understand canonicalization and normalization to mean removing any non-meaningful or ambiguous parts of of a data's presentation, turning effectively identical data into actually identical data.

For example, if you want to get the hash of some input data and it's important that anyone else hashing the canonically same data gets the same hash, you don't want one file indenting with tabs and the other using spaces (and no other difference) to cause two very different hashes.

In the case of JSON:

object properties would be placed in a standard order (perhaps alphabetically)
unnecessary white spaces would be stripped
indenting either standardized or stripped
the data may even be re-modeled in an entirely new syntax, to enforce the above

Is my definition correct, and the terms are interchangeable? Or is there a well-defined and specific difference between canonicalization and normalization of input data?

Suprisingily a duplicate of this is not easy to find on [so]. But precise definitions for the general mathematical meaning should be easy to find & you will find a specific meaning in a specific context. — philipxy, Mar 21 '19 at 22:42
Yes, that is why I asked it. And I did not find the mathematical meanings easy to find, nor do mathematical definitions _always_ correspond to a software development definitions. — Jacob Ford, Mar 21 '19 at 22:44
It is always a good idea to go to published academic resources. (And for languages & products, manuals.) There are almost always textbooks & slides available free online in pdf. Eg whatever is being addressed at the source for your Java example is probably also addressed in such a more precise & complete authoritative resource. — philipxy, Mar 21 '19 at 22:50
Could you point me to any? Again, I would love to find the formal definition I'm looking for, but I have not found what you are telling me to find. And to clarify, my example was JSON (not Java). My question is specific to the field of data parsing/processing, but is language-agnostic. — Jacob Ford, Mar 22 '19 at 02:32
See my edited answer. (It already had links re canonical/normal forms.) The usage in CS is just the usage in math. http://mathworld.wolfram.com/CanonicalForm.html http://mathworld.wolfram.com/NormalForm.html — philipxy, Mar 22 '19 at 04:28

philipxy · Answer 1 · 2019-03-22T04:17:46.947

"Canonicalize" & "normalize" (from "canonical (form)" & "normal form") are two related general mathematical terms that also have particular uses in particular contexts per some exact meaning given there. It is reasonable to label a particular process by one of those terms when the general meaning applies.

Your characterizations of those specific uses are fuzzy. The formal meanings for general & particular cases are more useful.

Sometimes given a bunch of things we partition them (all) into (disjoint) groups, aka equivalence classes, of ones that we consider to be in some particular sense similar or the same, aka equivalent. The members of a group/class are the same/equivalent according to some particular equivalence relation.

We pick a particular member as the representative thing from each group/class & call it the canonical form for that group & its members. Two things are equivalent exactly when they are in the same equivalence class. Two things are equivalent exactly when their canonical forms are equal.

A normal form might be a canonical form or just one of several distinguished members.

To canonicalize/normalize is to find or use a canonical/normal form of a thing.

Canonical form.

The distinction between "canonical" and "normal" forms varies by subfield. In most fields, a canonical form specifies a unique representation for every object, while a normal form simply specifies its form, without the requirement of uniqueness.

Applying the definition to your example: Have you a bunch of values that you are partitioning & are you picking some member(s) per each class instead of the other members of that class? Well you have JSON values and short of re-modeling them you are partitioning them per what same-class member they map to under a function. So you can reasonably call the result JSON values canonical forms of the inputs. If you characterize re-modeling as applicable to all inputs then you can also reasonably call the post-re-modeling form of those canonical values canonical forms of re-modeled input values. But if not then people probably won't complain that you call the re-modeled values canonical forms of the input values even though technically they wouldn't be.

Thank you @philpxy for the thorough answer. I think it can be trimmed while still having its citations—the concise answer being: **_Canonicalization_ is the conversion of data to a _specific_ normal, contextually equivalent form (say, alphabetizing characters in an anagram). _Normalization_ is the conversion of data to _any_ normal form (alphabetizing, or reverse alphabetizing, or stripping non-ASCII characters).** They're _often_ interchangeable, simply because one could say _canonicalization_ without specifying a canonical form. Accurate? Would you prefer I edit yours or post a new answer? — Jacob Ford, Mar 22 '19 at 18:48
Your phrasing is fuzzy. Since it uses "conversion" & "normal" and "equivalent", it begs the question. Also "speciifc" & "contextually" & "form" aren't definite either. (Notice too that you seemed to think that an example was needed.) Also the terms apply not just to "data" but any values/objects/things. I have given the standard mathematical meanings. I can edit in a TL;DR later. — philipxy, Mar 22 '19 at 19:02

Panos · Answer 2 · 2022-09-29T08:52:13.170

Consider a set of objects, each of which can have multiple representations. From your example, that would be the set of JSON objects and the fact that each object has multiple valid representations, e.g., each with different permutations of its members, less white spaces, etc.

Canonicalization is the process of converting any representation of a given object to one and only one, unique per object, representation (a.k.a, canonical form). To test whether two representations are of the same object, it suffices to test equality on their canonical forms, see also wikipedia's definition.

Normalization is the process of converting any representation of a given object to a set of representations (a.k.a., "normal forms") that is unique per object. In such case, equality between two representations is achieved by "subtracting" their normal forms and comparing the result with a normal form of "zero" (typically a trivial comparison). Normalization may be a better option when canonical forms are difficult to implement consistently, e.g., because they depend on arbitrary choices (like ordering of variables).

Section 1.2 from the "A=B" book, has some really good examples for both concepts.

Is there a well-defined difference between "normalizing" and "canonicalizing" data?

2 Answers2