determining soundex conversion

Question

when converting the name 'Lukasieicz' to soundex (LETTER,DIGIT,DIGIT,DIGIT,DIGIT), I come up with L2222.

However, I am being told by my lecture slides that the actual answer is supposed to be L2220.

Please explain why my answer is incorrect, or if the lecture answer was just a typo or something.

my steps:

Lukasieicz

remove and keep L

ukasieicz

Remove contiguous duplicate characters

ukasieicz

remove A,E,H,I,O,U,W,Y

KSCZ

convert up to first four remaining letters to soundex (as described in lecture directions)

2222

append beginning letter

L2222

Are you applying the "side-by-side" rule (see [here](http://stackoverflow.com/q/1626217/168657))? — mob, Oct 16 '15 at 20:05
I figured out how you can get `L2220` and updated my answer. — Schwern, Oct 16 '15 at 20:44

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

If this is American Soundex as defined by the National Archives you're both wrong. American Soundex contains one letter and three numbers, you can't have L2222 nor L2220. It's L222.

But let's say they added another number for some reason.

The basic substitution gives L2222. But you're supposed to collapse adjacent letters with the same numbers (step 3 below) and then pad with zeros if necessary (step 4).

If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.

If you have too few letters in your word that you can't assign [four] numbers, append with zeros until there are [four] numbers. If you have more than [4] letters, just retain the first [4] numbers.

Lukasieicz    # the original word
L_2_2___22    # replace with numbers, leave the gaps in
L_2_2___2     # apply step 3 and squeeze adjacent numbers
L2220         # apply step 4 and pad to four numbers

We can check how conventional (ie. three number) soundex implementations behave with the shorter Lukacz which becomes L_2_22. Following rules 3 and 4, it should be L220.

The National Archives recommends an online Soundex calculator which produces L220. So does PostgreSQL and Text::Soundex in both its original flavor and NARA implementations.

$ perl -wle 'use Text::Soundex; print soundex("Lukacz"); print soundex_nara("Lukacz")'
L220
L220

MySQL, predictably, is doing its own thing and returns L200.

This function implements the original Soundex algorithm, not the more popular enhanced version (also described by D. Knuth). The difference is that original version discards vowels first and duplicates second, whereas the enhanced version discards duplicates first and vowels second.

In conclusion, you forgot the squeeze step.

determining soundex conversion

1 Answers1