1

From the answers to this question I tried to make my program more safe by converting strings to hex and comparing those values instead of directly and dangerously using strings directly from the user. I modified the code on that question to add a conversion:

function mssql_escape($data) {
    if(is_numeric($data))
        return $data;

    $data = iconv("ISO-8859-1", "UTF-16", $data);

    $unpacked = unpack('H*hex', $data);

    return '0x' . $unpacked['hex'];
}

I do this because in my database I am using nvarchar instead of varchar. Now when I run through it on the php side, it comes up with

0xfeff00680065006c006c006f00200077006f0072006c00640021

Then I run the following query:

 declare @test nvarchar(100);
 set @test = 'hello world!';
 select CONVERT(VARBINARY(MAX), @test);

It results in:

0x680065006C006C006F00200077006F0072006C0064002100

Now you'll notice those numbers are ALMOST the same. Other than the trailing zeros, the only difference is feff00. Why is that there? I realize all I would have to do is shift, but I'd really like to know WHY it's there instead of just making an assumption. Can anybody explain to me why php decides to throw feff00 (yellow!) in the front of my hex?

Community
  • 1
  • 1
Andrew
  • 1,571
  • 17
  • 31

1 Answers1

0

Well, Andrew, I seem to answer a lot of your questions. This link explains:

So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

And Wikipedia explains:

If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text display that expects the text to be ISO-8859-1.

if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a text display that expects the text to be ISO-8859-1.

So the code you displayed with FEFF, which means it's in Big Endian notation. Use UTF-16LE for little endian, and SQL will understand that. Shifting the first SIX hex digits will only coincidentally work as long as you're only using two bytes.

Community
  • 1
  • 1
Andrew
  • 1,571
  • 17
  • 31