How to remove %EF%BB%BF in a PHP string

Question

I am trying to use the Microsoft Bing API.

$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));

The data returned has a ' ' character in the first character of the returned string. It is not a space, because I trimed it before returning the data.

The ' ' character turned out to be %EF%BB%BF.

I wonder why this happened, maybe a bug from Microsoft?

How can I remove this %EF%BB%BF in PHP?

score 18 · Answer 1 · edited May 06 '15 at 18:16

18

You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.

The reasons:

In UTF-8, a BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three characters of your response instead.
The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded information. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
Not all BOMs are 3-bytes long, only the UTF-8 one is three bytes. UTF-16 is two bytes, and UTF-32 is four bytes. So if the service switches to a wider UTF encoding in the future, your code will break.

I think a more appropriate way to handle this would be something like:

/* Detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);

edited May 06 '15 at 18:16

Peter Mortensen

30,738
21
105
131

answered Oct 30 '10 at 08:21

Lee

13,462
1
32
45

2

This doesn't appear to work in practice. `mb_convert_encoding("\357\273\277some text", 'ASCII', mb_detect_encoding("\357\273\277some text"))` yields `string(10) "?some text"`. Notice that it left a question mark in the output. – mpen Jan 28 '14 at 19:59
@mark Unfortunately, that does appear to be true. I had better luck using `iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE',"\357\273\277some text")` to do the converstion. I guess `mb_detect_encoding` would be used to detect the initial charset, which would then be passed as the first arg to `iconv`. This is more of a hack than it should be. – Lee Feb 04 '14 at 20:20
1

@mark I had to add the following line to get rid of the ? : ini_set('mbstring.substitute_character', "none"); – naw103 Nov 20 '14 at 22:40

D3F4ULT · Answer 2 · 2013-08-13T21:02:47.953

6

$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav"); $data = stripslashes(trim($data));

if (substr($data, 0, 3) == "\xef\xbb\xbf") { $data = substr($data, 3); }

edited Aug 13 '13 at 21:02

answered Aug 13 '13 at 20:54

D3F4ULT

926
1
12
20

This solution is great! It helped me a lot when I couldn't find the answer. The best logical part is to make a condition, if the BOM Hex Charakters exists, and then delete them. This code seems future save, even when the Server will not send BOM, this function will still works. +1 – Gkiokan Nov 17 '16 at 23:52

score 2 · Answer 3 · edited May 06 '15 at 18:16

2

It's a byte order mark (BOM), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.

edited May 06 '15 at 18:16

Peter Mortensen

30,738
21
105
131

answered Oct 30 '10 at 07:42

Eric Bowman - abstracto -

1,915
1
14
24

score 0 · Answer 4 · answered Jul 09 '13 at 20:54

0

I had the same problem today, and fixed by ensuring the string was set to UTF-8:

http://php.net/manual/en/function.utf8-encode.php

$content = utf8_encode ( $content );

answered Jul 09 '13 at 20:54

a coder

7,530
20
84
131

score -1 · Answer 5 · answered Oct 30 '10 at 07:40

-1

To remove it from the beginning of the string (only):

$data = preg_replace('/^%EF%BB%BF/', '', $data);

answered Oct 30 '10 at 07:40

enobrev

22,314
7
42
53

score -1 · Answer 6 · answered Oct 30 '10 at 07:42

-1

$data = str_replace('%EF%BB%BF', '', $data);

You probably shouldn't be using stripslashes -- unless the API returns blackslashed data (and 99.99% chance it doesn't), take that call out.

answered Oct 30 '10 at 07:42

Amy B

17,874
12
64
83

score -3 · Accepted Answer · answered Oct 30 '10 at 07:42

-3

You could use substr to only get the rest without the UTF-8 BOM:

// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);

answered Oct 30 '10 at 07:42

Gumbo

643,351
109
780
844

Note: generally speaking, throwing away the BOM is not a good idea. The BOM is there to tell you how the rest of the string should be handled. If you just ignore it, assuming that it's a UTF-8 3-byte BOM, you're setting yourself up for some real problems if/when the encoding ever changes. ... Please have a look at my answer below for more details. – Lee Oct 30 '10 at 08:23
2

To future googlers: [use this solution instead](http://stackoverflow.com/a/4057875/457104). Throwing away the BOM is a **bad idea**. – crdx Nov 15 '12 at 16:53

How to remove %EF%BB%BF in a PHP string

7 Answers7

Linked

Related