2

I got a messy bunch of JSON data to import into my database (for further purposes). When i checked them out (opened in a Text Editor), they contain so many weird (gibberish) characters like:

  • \u00a0

For example, data.json:

[{"title":"hello world!","html_body":"<p>Hello\u00a0 from the\u00a0  other side.\u00a0 <\/p>"}]

And then, obviously, below code simply WON'T work:

$clean = str_replace("\u00a0", "", $string);

Despite whatever those character are for, how can i get rid of them anyway please?

夏期劇場
  • 17,821
  • 44
  • 135
  • 217
  • U+00A0 is a no-break space, not gibberish. It *may* be meaningful and intentional. (Though in this case it may not be.) – deceze Aug 24 '17 at 07:48
  • https://stackoverflow.com/questions/20734771/php-json-remove-every-occurence-of-certain-character-before-another-character – Alive to die - Anant Aug 24 '17 at 07:49
  • @deceze did you recognise this immediately, or did you research what this character (group) was? – Martin Aug 24 '17 at 07:50
  • 1
    @Martin OS X's handy character viewer tool… – deceze Aug 24 '17 at 07:50
  • * shakes fist * damn Apple, being better!! `:-D` – Martin Aug 24 '17 at 07:51
  • 1
    Are there many other characters in this JSON, or just this one, that needs removal? – Martin Aug 24 '17 at 07:52
  • 1
    @axiac strange, that deceze claims its a non-breaking space, it can't be both.... – Martin Aug 24 '17 at 07:57
  • Guys, please help with a way to get rid of them. (Regardless of whether these are 'newline characters' or whatever). Because the data will be used for Data Mining purpose. So i rather don't contain any funny characters. Thank you all guys :))) – 夏期劇場 Aug 24 '17 at 07:58
  • @Martin oops, I misread `0a` when in fact it is `a0`. @deceze is right. It's a non-breaking space https://en.wikipedia.org/wiki/Non-breaking_space – axiac Aug 24 '17 at 08:03
  • You can decode the JSON, remove the undesired character(s) from the strings then encode the as JSON again, if needed. – axiac Aug 24 '17 at 08:06

2 Answers2

10

Thanks for everyone in the comment section, who (at least) helped me to know those are non-breaking characters. I then googled and found a working solution by myself anyhow:

$clean_html_body = preg_replace('/\xc2\xa0/', '', $html_body);

Thanks again all. :)

夏期劇場
  • 17,821
  • 44
  • 135
  • 217
  • The solution should be that that character set for the JSON at creation time should be properly detected and encoded into UTF-8 or similar. But this raises its own problems of the false positive nature of character encoding detection – Martin Aug 24 '17 at 08:09
  • If he does his own json_encoding, he could try adding JSON_UNESCAPED_UNICODE as an option. – DocWeird Aug 24 '17 at 08:11
  • Yes. But as i mentioned, it was given to me. Means, i wasn't the one generated the JSON file. That also means, i do not own the original data source. So i needed a solution at the point of time i received the JSON files. (The files are too much in amount actually. So i cannot simply manually cleanup) – 夏期劇場 Aug 24 '17 at 08:12
1

If you have individual strings that might have non-breaking spaces or line returns at the end of them, you can trim these when putting together your JSON data by using this:

$dat = trim($dat," \t\n\r\0\x0B\xc2\xa0");