-2
use HTML::Entities;
$output = encode_entities($str);

Sometimes the above function generates � which does not display in a browser, like this: �

What is the problem and how can I get entities like this to display properly?

Would encode_entities_numeric be better instead?

EDIT: Should I use the following instead?

use HTML::Entities;
utf8::decode($str);
$output = encode_entities($str);
CJ7
  • 22,579
  • 65
  • 193
  • 321
  • Too broad. It's impossible to say without additional information. – Matt Jacob Mar 21 '17 at 22:35
  • @MattJacob What do you mean? What additional information do you need? – CJ7 Mar 21 '17 at 22:41
  • For example... What font are you using? What browser are you using? What do you mean by "does not display"? – Matt Jacob Mar 21 '17 at 22:47
  • No font. Firefox and IE. It displays a little box instead. Try putting `�` into an html file and see for yourself what it looks like in a browser. – CJ7 Mar 21 '17 at 22:49
  • What do you mean "no font"? Even the default font is *some* font! – Matt Jacob Mar 21 '17 at 23:01
  • There is no font specified in the html. – CJ7 Mar 21 '17 at 23:05
  • 1
    i'm pretty sure that you have some Encoding inconsistency. Didn't tried you decode e.g. umlaut `\x{fc}` as utf8 or sometbing like? (try enable _CHECK_ in the `Encode::(de|en)code`) – clt60 Mar 21 '17 at 23:19
  • @jm666 I think you are right. If do `decode('ISO-8859-1')` instead of `utf-8` it seems to be fine. How can you detect the encoding of the source? – CJ7 Mar 22 '17 at 00:20
  • In short: _you can't_. Just imagine how you want differentiate ISO1 and ISO2? For the longer answer see: http://stackoverflow.com/q/1970660/632407. :) – clt60 Mar 22 '17 at 00:27
  • 1
    Technically, you can [reliably guess](http://stackoverflow.com/q/28681864/589924) if you know it's between iso-8859-1 and UTF-8, but guessing in general is "somewhat harder". (The linked question is about cp1252 and UTF-8, but the same applies to iso-8859-1 and UTF-8.) – ikegami Mar 22 '17 at 06:36
  • @ikegami Me somewhat missed the linked answer by you, yet. ++ to it. – clt60 Mar 22 '17 at 08:35
  • @ikegami Is it likely to be anything other than `iso-8859-1` and `utf8`? – CJ7 Mar 22 '17 at 21:37
  • If you have to ask, yes. This machine's console uses cp850, and this machine's OS uses your choice of cp1252 or UTF-16le. There are many more than the two encodings you mention in common use. – ikegami Mar 22 '17 at 21:39
  • @ikegami Is `utf8` the main problem? Therefore if I do `utf8::decode($str)` will that solve most of my problems? I am just trying to create a generic function to convert characters in strings to HTML entities where necessary. – CJ7 Mar 22 '17 at 21:59
  • Re "*Is utf8 the main problem?*", No, the problem is surely that your code is buggy. (Note that the encoding's name is "UTF-8". The case isn't important, but the dash is. Without the dash, you are talking about an internal-to-Perl extension of UTF-8.) – ikegami Mar 22 '17 at 22:10
  • @ikegami If you see the edit to my question, that is all my code is. – CJ7 Mar 22 '17 at 23:40
  • Liar. If that was true, `$str` would be `undef`, and `$output` would contain an empty string. You obviously had a value in `$str`, so that's NOT your entire program. You have code that placed `"\x{FFFD}"` in `$str`, and it was either intentional, or due to a bug in your code. – ikegami Mar 22 '17 at 23:42
  • @ikegami What I mean is that if I use the code after "EDIT", will that be solution to my issue? – CJ7 Mar 22 '17 at 23:47
  • I thought that was clear from my Answer, but I rephrased it to be clearer: If you don't want `encode_entities($str)` to produce `�`, don't put character `0xFFFD` in `$str`. It was probably added as the result of a character-decoding error (e.g. bad input, or bad handling of the input). You'll need to debug to find the underlying problem. – ikegami Mar 22 '17 at 23:54
  • `utf8::decode($str);` makes absolutely no sense there. `$str` doesn't contain text encoded using UTF-8 if it contained character `0xFFFD`. (It will leave `$str` unchanged and return false to signal an error.) – ikegami Mar 22 '17 at 23:56

1 Answers1

2

If encode_entities($str) produces �, it's because $str contains character 0xFFFD.

So if you don't want encode_entities($str) to produce �, don't put character 0xFFFD in $str. It was probably added as the result of a character-decoding error (e.g. bad input, or bad handling of the input). You'll need to debug to find the underlying problem.

ikegami
  • 367,544
  • 15
  • 269
  • 518