4

I'm trying to read ID3 data in bulk. On some of the tracks, ÿþ appears. I can remove the first 2 characters, but that hurts the tracks that don't have it.

This is what I currently have:

$trackartist=str_replace("\0", "", $trackartist1);

Any suggestions would be greatful, thanks!

austinh
  • 1,061
  • 6
  • 13
  • 34
  • 4
    Does `str_replace("ÿþ", "", $trackartist1);` work? – Martin Tournoij Oct 21 '14 at 17:52
  • No, it does not. @Carpetsmoker – austinh Oct 21 '14 at 17:52
  • 2
    Can you provide a sample string of ID3 data? str_replace supports multibyte strings, and @Carpetsmoker's suggestion seems to work: http://codepad.org/Od59V0ki – danronmoon Oct 21 '14 at 17:55
  • 1
    Why doesn't @Carpetsmoker suggestion work? That would seem to be the answer here. To go further you can add an IF statement around that str_replace to see if starts with "ÿþ" first. – Scott Oct 21 '14 at 17:55
  • 2
    Can you post a `var_dump()` of your string to see what it contains exactly? – jeroen Oct 21 '14 at 17:56
  • IF I use `$tracktitle=str_replace("ÿþ", "", $tracktitle1);` It doesn't work, but if I use `$tracktitle=str_replace("ÿþ", "", "ÿþChange The Way");` it does work – austinh Oct 21 '14 at 18:05
  • @Carpetsmoker This is the var_dump for one `ÿþAmong The Thirsty` – austinh Oct 21 '14 at 18:13
  • You should solve this from the root. This mark is the UCS2 file BOM, escape them when read a UCS2 file. https://stackoverflow.com/questions/11643500/how-to-read-a-ucs-2-file/64983143#64983143 – Zhang Nov 24 '20 at 09:02

3 Answers3

9

ÿþ is 0xfffe in UTF-8; this is the byte order mark in UTF-16. You can convert your string to UTF-8 with iconv or mb_convert_encoding():

$trackartist1 = iconv('UTF-16LE', 'UTF-8', $trackartist1);

# Same as above, but different extension
$trackartist1 = mb_convert_encoding($trackartist1, 'UTF-16LE', 'UTF-8');

# str_replace() should now work
$trackartist1 = str_replace('ÿþ', '', $trackartist1);

This assumes $trackartist1 is always in UTF-16LE; check the documentation of your ID3 tag library on how to get the encoding of the tags, since this may be different for different files. You usually want to convert everything to UTF-8, since this is what PHP uses by default.

Martin Tournoij
  • 26,737
  • 24
  • 105
  • 146
  • When I use `$trackartist1 = iconv('UTF-8', 'UTF-16', $trackartist1);` and `str_replace('ÿþ', '', $trackartist1);` it then switches to þÿ at the beginning – austinh Oct 21 '14 at 18:29
  • Second one should be `mb_convert_encoding($message, 'UTF-8', 'UTF-16LE')` – n-dru Oct 06 '21 at 09:31
1

I had a similar problem but was not able to force UTF-16LE as the input charset could change. Finally I detect UTF-8 as follows:

if (!preg_match('~~u', $html)) {

For the case that this fails I obtain the correct encoding through the BOM:

function detect_bom_encoding($str) {
    if ($str[0] == chr(0xEF) && $str[1] == chr(0xBB) && $str[2] == chr(0xBF)) {
        return 'UTF-8';
    }
    else if ($str[0] == chr(0x00) && $str[1] == chr(0x00) && $str[2] == chr(0xFE) && $str[3] == chr(0xFF)) {
        return 'UTF-32BE';
    }
    else if ($str[0] == chr(0xFF) && $str[1] == chr(0xFE)) {
        if ($str[2] == chr(0x00) && $str[3] == chr(0x00)) {
            return 'UTF-32LE';
        }
        return 'UTF-16LE';
    }
    else if ($str[0] == chr(0xFE) && $str[1] == chr(0xFF)) {
        return 'UTF-16BE';
    }
}

And now I'm able to use iconv() as you can see in @carpetsmoker answer:

iconv(detect_bom_encoding($html), 'UTF-8', $html);

I did not use mb_convert_encoding() as it did not remove the BOM (and did not convert the linebreaks as iconv() does):
enter image description here

Community
  • 1
  • 1
mgutt
  • 5,867
  • 2
  • 50
  • 77
0

Use regex replacement:

$trackartist1 = preg_replace("/\x00?/", "", $trackartist1);

The regex above seeks the first occurrence of "\x00"(hexadecimal zeros), if possible, and replaces it with nothing.

Tala
  • 909
  • 10
  • 29
  • @Carpetsmoker my bad! I thought he wants to get `\0` characters out as mentioned in his code. I didn't notice `\xfffe`. – Tala Oct 21 '14 at 18:41