7

I'm testing a metaphone implementation for C# and comparing its results against the built-in metaphone() function from PHP. However, I've come across a bug (which is previously documented in PHP's issue tracker and discussed on a mailing list), but I'm trying to understand the C code behind their bug for my own personal interest.

Basically, according to the metaphone algorithm, most instances of -gh- should be rendered silent. In the specific test case of "wright", I expect (and generate with my own algorithm) a metaphone key of "RT"

"wr" => R
"i"  => ignored
"gh" => ignored
"t"  => T

Result: RT

However, PHP's metaphone function returns RFT. Clearly, it's converting the -gh- to an F, as if it were at the end of a word (e.g. "rough"), but in the case of the word "wright", this is incorrect, because the -gh- does not come at the end of the word. Looking at the metaphone.c file in the PHP source distribution, I see a few key things:

/* These prevent GH from becoming F */
#define NOGHTOF(c)  (ENCODE(c) & 16)    /* BDH */

...

/* Go N letters back. */
#define Look_Back_Letter(n) (w_idx >= n ? toupper(word[w_idx-n]) : '\0')

And then on line 342:

case 'G':
    if (Next_Letter == 'H') {
        if (!(NOGHTOF(Look_Back_Letter(3)) || Look_Back_Letter(4) == 'H')) {
            Phonize('F');
            skip_letter++;

Can someone help me understand what exactly the NOGHTOF function does and why this code is incorrectly rendering an F for the -gh- in "wright"? I'm not really a C guy, so the code isn't at all clear to me.

Charles
  • 50,943
  • 13
  • 104
  • 142
Chris
  • 27,596
  • 25
  • 124
  • 225

1 Answers1

1

The meaning of NOGHTOF(c) is actually determined by the code starting at line 81:

char _codes[26] = {
        1, 16, 4, 16, 9, 2, 4, 16, 9, 2, 0, 2, 2, 2, 1, 4, 0, 2, 4, 4, 1, 0, 0, 0, 8, 0
    /*  a  b   c  d   e  f  g  h   i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z */
};

#define ENCODE(c) (isalpha(c) ? _codes[((toupper(c)) - 'A')] : 0)

Essentially, a value is assigned for each letter of the alphabet in order (A = 1, B = 16, etc.) Then ENCODE macro checks whether the passed character is a letter; if yes, it returns the corresponding code for that letter, otherwise it returns the null character. (It doesn't really return anything, as this is a macro and is substituted by the compiler at compile time to replace the actual call.)

The way I'm reading the code for 'G' is this (without trying to understand why):

If current letter is G then
    If next letter is H then
        Take "_code" value of a letter three letters back (why?) from the _codes table and check the fifth bit (from the back, naturally)
        If this bit is not set OR if a letter four letters back (why?) is 'H' then
            Add 'F' to the result
            skip one more character (letter 'H' following the 'G')

Why it is like this is beyond me though, I'm quite sure somebody had a good reason to write it this way, but it seems an obvious bug to me.

Aleks G
  • 56,435
  • 29
  • 168
  • 265
  • Thanks. I'm only somewhat familiar with bit level operators. Can you tell me how exactly 'AND'ing a number with 16 clears the last 4 bits? – Chris Feb 13 '12 at 21:12
  • First, my mistake, it's not clearing the last 4 bit - it checks whether the fifth bit is set - i'm updating my answer. Now, you're not dealing with any number, but only with one byte (8 bits): xxxxxxxx in binary; 16 in binary is 00010000; now bitwise AND takes corresponding bits of two numbers and create a new number by setting the corresponding bit to 1 only if both bits are 1. – Aleks G Feb 13 '12 at 21:28
  • Right, I got what the & operator does. I figured it was checking to see if bit 5 was set, but was confused by your answer. Thanks for clearing that up. Having said that, yes, I'm also very unsure why checking whether the third letter before the G is ('B','D','H') would render the -gh- silent. Perhaps there the original coder was targeting a select few words in this manner (bough and dough I get, but hough?), but no doubt the code is incorrect / buggy as hell. Thanks for the additional insight. – Chris Feb 13 '12 at 21:38