4

The result from the Google+ API has \ufeff appended to the end of every "content" result (I don't really know why?)

What is the best way to remove this unicode character from the json result? It is producing a '?' in some of the output I am displaying.

Example:

https://developers.google.com/+/api/latest/activities/get#try-it 

enter activity id

z12pvrsoaxqlw5imi22sdd35jwvkglj5204

and click Execute, result will be:

{
 .....
 "object": {
  ......
  "content": "CONTENT OF GOOGLE PLUS POST HERE \ufeff",
  ......

example PHP code which shows a '?' where the '\ufeff' is:

<?php
$data = json_decode($result_from_google_plus_api, true);
echo $data['object']['content'];
// outputs "CONTENT OF GOOGLE PLUS POST HERE ?"
echo trim($data['object']['content']);
// outputs "CONTENT OF GOOGLE PLUS POST HERE ?"

Or am I going about this the wrong way? Should I be fixing the '?' issue rather than trying to remove the '\ufeff'?

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
dtbaker
  • 4,679
  • 6
  • 28
  • 30
  • 1
    It's quite unusual to see a BOM at the end of a string ... – Ja͢ck May 05 '14 at 02:07
  • In general, you can filter all invalid utf-8 characters by using [this answer](http://stackoverflow.com/a/11709412/1338292). – Ja͢ck May 05 '14 at 02:26
  • @Jack except that `\ufeff` is valid UTF-8 and will not be caught by the answer you posted – mark Sep 18 '14 at 12:58

2 Answers2

10

In your case, you could use this regexp:

$str = preg_replace('/\x{feff}$/u', '', $str);

That way you can exactly match that code point value and have it removed.

From my experience there are a lot more white-spacey-character you want to remove. From my experienced this works well for me:

# I like to call this unicodeTrim()
$str = preg_replace(
  '/
    ^
    [\pZ\p{Cc}\x{feff}]+
    |
    [\pZ\p{Cc}\x{feff}]+$
   /ux',
  '',
  $str
);

I found http://www.regular-expressions.info/unicode.html a pretty good resource about the fine details:

  • \pZ - match any kind of whitespace or invisible separator
  • \p{Cc} - match control characters
  • \x{feff} - match BOM

I've seen regex suggest to match \pC instead of \pCc, however this is dangerous because pC includes any code point to which no character has been assigned. I've had actual data (certain emojis or other stuff) being removed because of this.

But, YMMW, I cant' stress this.

mark
  • 6,308
  • 8
  • 46
  • 57
  • Thanks mark! I'm a few weeks off getting back to this project, once I do I'll implement this regex and let you know how it goes :) cheers! – dtbaker Sep 21 '14 at 00:42
1

By Respect to All Answers


I test most of answers but finally find solution here: GitHub
$field = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $field);
Eyni Kave
  • 1,113
  • 13
  • 23