7

This is something that should be simple but I can't figure out.

The site in question is UTF-8 encoded.

A customer has been having trouble filling out a form on our website. Here is example data they have entered.

SPICER-SMITHS LOST

It looks like a regular string, but when you copy that string into an app like notepad++ you'll see a "?" appear in the word "SMITHS" ("SMITH?S").

The script sanitizes the field and goes the extra step of removing the following characters: "\r\n", "\n", "\r", "\t", "\0", "\x0B".

It's not catching this hidden character though.

Does anybody know what's going on here?

EDIT: I'm using php. Here is the function that I use to sanitize the field:

function strip_hidden_chars($str)
{
    $chars = array("\r\n", "\n", "\r", "\t", "\0", "\x0B");

    $str = str_replace($chars," ",$str);

    return preg_replace('/\s+/',' ',$str);
}

EDIT 2: @thaJeztah led me to the answer. The string I was testing was the output from our support ticket after the customer had copied and pasted it from whatever application she is using. The actual input was

SPICER-SMITH’S

Bill H
  • 2,069
  • 2
  • 20
  • 29
  • I figure out it was this character http://www.fileformat.info/info/unicode/char/92/index.htm I just don't know how to strip it out. – Bill H Feb 01 '13 at 20:29
  • By the way, you can remove control character with the regex `'/(?=[^\n\r\t])\p{Cc}/u'`.. it also handles the less known c1 controls, not just ASCII. – Esailija Feb 02 '13 at 09:22

3 Answers3

4

You may try to have a look here; remove control characters?

Remove control characters from php String

Community
  • 1
  • 1
thaJeztah
  • 27,738
  • 9
  • 73
  • 92
  • Thanks for the find. I tried some things in there but it's still not working. – Bill H Feb 01 '13 at 20:43
  • I accepted your answer because it led to me realizing that I was testing the output instead of what was originally inputted. – Bill H Feb 01 '13 at 21:20
  • 2
    @BillH can you update your question and add the things you did to get it solved? Trying to preserve the quality of StackOverflow :) – thaJeztah Feb 01 '13 at 21:30
1

this also work as well

$chars = array("\r\n", '\\n', '\\r', "\n", "\r", "\t", "\0", "\x0B");
str_replace($chars,"<br>",$data);
ρяσѕρєя K
  • 132,198
  • 53
  • 198
  • 213
0

I used to have similar issues importing many csv files from different sources, a lot of those entries are none UTF-8 characters. Here is what i did that finally works with all the files so far with the explanation comments beside:

$row[$id] = str_replace(chr(130), ',', $row[$id]);    // baseline single quote
$row[$id] = str_replace(chr(131), 'NLG', $row[$id]);  // florin
$row[$id] = str_replace(chr(132), '"', $row[$id]);    // baseline double quote
$row[$id] = str_replace(chr(133), '...', $row[$id]);  // ellipsis
$row[$id] = str_replace(chr(134), '**', $row[$id]);   // dagger (a second footnote)
$row[$id] = str_replace(chr(135), '***', $row[$id]);  // double dagger (a third footnote)
$row[$id] = str_replace(chr(136), '^', $row[$id]);    // circumflex accent
$row[$id] = str_replace(chr(137), 'o/oo', $row[$id]); // permile
$row[$id] = str_replace(chr(138), 'Sh', $row[$id]);   // S Hacek
$row[$id] = str_replace(chr(139), '<', $row[$id]);    // left single guillemet
$row[$id] = str_replace(chr(140), 'OE', $row[$id]);   // OE ligature
$row[$id] = str_replace(chr(145), "'", $row[$id]);    // left single quote
$row[$id] = str_replace(chr(146), "'", $row[$id]);    // right single quote
$row[$id] = str_replace(chr(147), '"', $row[$id]);    // left double quote
$row[$id] = str_replace(chr(148), '"', $row[$id]);    // right double quote
$row[$id] = str_replace(chr(149), '-', $row[$id]);    // bullet
$row[$id] = str_replace(chr(150), '-', $row[$id]);    // endash
$row[$id] = str_replace(chr(151), '--', $row[$id]);   // emdash
$row[$id] = str_replace(chr(152), '~', $row[$id]);    // tilde accent
$row[$id] = str_replace(chr(153), '(TM)', $row[$id]); // trademark ligature
$row[$id] = str_replace(chr(154), 'sh', $row[$id]);   // s Hacek
$row[$id] = str_replace(chr(155), '>', $row[$id]);    // right single guillemet
$row[$id] = str_replace(chr(156), 'oe', $row[$id]);   // oe ligature
$row[$id] = str_replace(chr(159), 'Y', $row[$id]);    // Y Dieresis
//force convert to ISO-8859-1 then convert back to UTF-8 to remove the rest of unknown hidden characters
$row[$id] = iconv("UTF-8","ISO-8859-1//IGNORE",$row[$id]);
$row[$id] = iconv("ISO-8859-1","UTF-8",$row[$id]);
IK ZU QUAN
  • 21
  • 1
  • 1