6

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.

thanx

ZafarYousafi
  • 8,640
  • 5
  • 33
  • 39

2 Answers2

11

Search for the term "UTF-8", because that's what you're seeing.

UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.

In your example, you started with the smart quote character . Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)

I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.

If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.

librik
  • 3,738
  • 1
  • 19
  • 20
  • 2
    +1 Concur and emphasize: Most likely the server is doing the right thing, and you need to adapt either your code or your tools. In the simplest case, maybe all that is needed is to configure the viewing tool you are using to inspect the results to display UTF-8 instead of CP-1252 or ISO-8859-1 or whatever. – tripleee Aug 19 '12 at 07:20
  • Agreed. If you are viewing the output using a web browser like Internet Explorer, you can change the character set with a simple menu option. (Go to "View" and pick "Encoding", then change from "Western European (Windows)" to "UTF-8". You may also want to turn off the "Auto-Select" option.) When the encoding is set to UTF-8, Internet Explorer will take the 3 "garbage characters", interpret them as UTF-8 bytes, convert them back into a Unicode character, and display the Unicode character. In this case, librets does not need to change; you just change the way you view the output. – librik Aug 19 '12 at 09:24
  • thanx for clarification. LIBRETS provides a method to override the encoding but it seems it does not work. Librets forum is pathetic and they don't allow anyone to post questions unless they approve the registered user and they have not approved me even after a month. – ZafarYousafi Aug 19 '12 at 15:01
0

Question reminder:

"...I noticed characters like '’' is replaced with ’... i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters."

Strictly dealing with this part:

"What I need is a list of such garbage string and their equivalent characters."

Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.

<?php
  for ($i=128; $i<258; $i++)
    $tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";

  echo "<table border=1>
    <tr><td>&#</td><td>&quot;Garbage&quot;</td><td>symbol</td></tr>";
    echo $tmp1;
  echo "</table>";
?>

From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.

In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

.

"i decided to replace such garbage characeters with actual values after downloading data"

You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.

First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:

<?php
  //ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
  $tmp1 = "\$SearchArr = array(";
  for ($i=128; $i<258; $i++)
    $tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
  $tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
  $tmp1 .= ");";
  $tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>

The second array is the replace term:

<?php
  //Adapt for your relevant range.
  $tmp2 = "\$ReplaceArr = array(\n";
  for ($i=128; $i<258; $i++)
    $tmp2 .= "\"&#".$i.";\", ";
  $tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
  $tmp2 .= ");";

  echo $tmp1."\n<br><br>\n";
  echo $tmp2."\n";
?>

Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:

$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);

Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

RaulentRoi
  • 29
  • 3