10

How can I normalize a list of function arguments to a string, such that two argument lists convert to the same string iff they are effectively equivalent? The algorithm should

  1. Compare embedded hashes and lists deeply, rather than by reference
  2. Ignore hash key order
  3. Ignore difference between 3 and "3"
  4. Generate a relatively readable string (not required, but nice-to-have for debugging)
  5. Perform well (XS preferred over Perl)

This is necessary for memoization, i.e. caching the result of the function based on its arguments.

As a strawman example, Memoize uses this as a default normalizer, which fails #1 and #3:

$argstr = join chr(28),@_;  

For a while my go-to normalizer was

JSON::XS->new->utf8->canonical

However it treats the number 3 and the string "3" differently, based on how the scalar was used recently. This can generate different strings for essentially equivalent argument lists and reduce the memoization benefit. (The vast majority of functions won't know or care if they get 3 or "3".)

For fun I looked at a bunch of serializers to see which ones differentiate 3 and "3":

Data::Dump   : equal - [3] vs [3]
Data::Dumper : not equal - [3] vs ['3']
FreezeThaw   : equal - FrT;@1|@1|$1|3 vs FrT;@1|@1|$1|3
JSON::PP     : not equal - [3] vs ["3"]
JSON::XS     : not equal - [3] vs ["3"]
Storable     : not equal - <unprintable>
YAML         : equal - ---\n- 3\n vs ---\n- 3\n
YAML::Syck   : equal - --- \n- 3\n vs --- \n- 3\n
YAML::XS     : not equal - ---\n- 3\n vs ---\n- '3'\n

Of the ones that report "equal", not sure how to get them to ignore hash key order.

I could walk the argument list beforehand and stringify all numbers, but this would require making a deep copy and would violate #5.

Thanks!

Jonathan Swartz
  • 1,913
  • 2
  • 17
  • 28
  • There's also [Test::More](http://metacpan.org/module/Test::More)'s is_deeply, and [Test::Deep](http://metacpan.org/module/Test::Deep)'s eq_deeply. – Ether Jul 24 '12 at 21:56

2 Answers2

2

Pretty much any serializer will treat 3 and "3" differently, because it doesn't have knowledge that number and stringified number are same for you and this assumption is false for general data. You must normalize either input or output yourself.

For input, deep scan with replacing any stringified number with its value+0 will do. If you know where exactly numbers may be in input, you can shorten this scan considerably.

For output, some simple state machine or even regexp (yes, I know that output is not regular) will be most likely enough to strip number-only string values to numbers.

Oleg V. Volkov
  • 21,719
  • 4
  • 44
  • 68
  • Well, no, I list a number of serializers above (like Data::Dump and FreezeThaw) that don't. :) Perhaps you mean "any good serializer *should* treat 3 and "3" differently". I'm not so sure, given the ease and arbitrariness with which Perl values can flop between string and number. – Jonathan Swartz May 31 '12 at 16:31
  • Re scanning, I mentioned that input scanning is undesirable for performance' sake. If it has to be done, I'd want it to be in XS. But it would be a lot more efficient if the serializer had an option to just shut off the distinction. – Jonathan Swartz May 31 '12 at 16:32
  • How about output scanning I've mentioned then? This should be sufficiently fast. One important plus when comparing to depending on undocumented quirks is that you can always be sure that manually stripped values will indeed be stripped. – Oleg V. Volkov May 31 '12 at 16:34
  • @JonathanSwartz, well, I'd personally consider this a bug and would fill a report if I wouldn't be too lazy. :) So far I think JSON::XS and YAML::XS is most logical one. – Oleg V. Volkov May 31 '12 at 16:35
  • @JonathanSwartz Oleg's point is that *your function* could easily treat `3` and `'3'` differently in exactly the same way that JSON::XS and Storable do — and therefore, erasing the distinction when memoizing actually has the potential to make a function malfunction, unless the wrapped function is under a contract to play by certain rules wrt stringification/numification. – hobbs Jun 01 '12 at 00:05
  • @hobbs Can you give me an example of a reasonable function that would treat these differently? e.g. The JSON::XS manual says that given "$x = 5; print $x", $x will be treated as number before the print, and as string after the print. What kind of function would I write that would care about that distinction? See also http://stackoverflow.com/questions/288900/how-can-i-convert-a-string-to-a-number-in-perl - I don't see any answers there saying that it matters whether a scalar is a string or a number. – Jonathan Swartz Jun 01 '12 at 16:17
  • @JonathanSwartz see http://stackoverflow.com/questions/2980550/when-does-the-difference-between-a-string-and-a-number-matter-in-perl-5 – hobbs Jun 01 '12 at 16:29
  • Thanks. I'm still not seeing anything so far that changes how I'd want memoization to work for the functions I write (I don't use bitwise ops and don't care about using "0.00" as boolean), but I acknowledge that it would matter for some applications. Thus it would be ideal IMO if JSON::XS had a flag to affect this behavior. – Jonathan Swartz Jun 01 '12 at 20:49
2

YAML and its progeny sort hash keys by default. Set $YAML::SortKeys = 2 to get sorting on deep hashes.

Setting $YAML::Stringify to a true value and setting $YAML::XS::QuoteNumericStrings to a false value will help you normalize numeric values. The latter setting will "unquote" a string value that looks like a number.


Also, you can use $Data::Dumper::Sortkeys = 1 to normalize the output order with Data::Dumper. Setting $Data::Dumper::Useqq = 1 will unquote strings that look like numbers.

mob
  • 117,087
  • 18
  • 149
  • 283
  • Sorry, but no, YAML::XS will behave just as any serializer should. Try `perl -MYAML::XS -e 'my $v = "0333"; print YAML::XS::Dump $v; $v + 0; print YAML::XS::Dump $v; print "$v\n";'` – Oleg V. Volkov May 31 '12 at 16:03
  • @Oleg V. Volkov - thanks to your comment I learned more about what `$YAML::XS::QuoteNumericStrings` is for and edited my answer. But I would think that `"0333"` and `"333"` and `0333` (i.e. 219) should be treated as different inputs by the OP. – mob May 31 '12 at 16:22
  • 1
    Regarding the rest, YAML just stripped string regardless of ::Stringify and for both YAML and Dumper I find depending on undocumented behavior a disturbing idea. – Oleg V. Volkov May 31 '12 at 16:24