0

I am at a lost of how to remove special characters from a string to make sure that only chars supported by uft-8 + French chars are included. The below base64 string has special chars and the my sanitizing function has failed to remove them and this is causing the text not to print when using FPDF cells etc. If you decode the string at https://www.base64decode.org/ you will see the special chars.

// My sanitizing function
static function remove_none_word_chars($string) {
        return preg_replace('/[^a-zA-Z0-9`_.,;@#%~\’\'\"+*\?\^\[\]\$\(\)\{\}\=!\<\>\|\-:\s\/\\sàâçéèêëîïôûùüÿñæœ]/ui', '', $string);
    }

74KnIFN1cGVydmlzZXIgbGUgdHJhdmFpbCBkZSBs4oCZZW5zZW1ibGUgZHUgcGVyc29ubmVsIGRlIHByb2R1Y3Rpb24sIGRlIGzigJllbnRyZXRpZW4gZXQgZGUgbGEgbWFpbnRlbmFuY2Ugc3VyIGxlIHF1YXJ0IGRlIG51aXQgZW4gdGVuYW50IGNvbXB0ZSBkZSBsYSBjb252ZW50aW9uIGNvbGxlY3RpdmU7Cu+CpyBBc3N1cmVyIHVuZSBib25uZSBnZXN0aW9uIGRlIGzigJllbnNlbWJsZSBkZXMgb3DDqXJhdGlvbnMgZGUgbOKAmXVzaW5lOwrvgqcgUGxhbmlmaWVyIGRlcyBvcMOpcmF0aW9ucyBlbiBmb25jdGlvbiBkZXMgYm9ucyBkZSBjb21tYW5kZTsK74KnIEFwcG9ydGVyIGxlcyBtb2RpZmljYXRpb25zIGV4aWfDqWVzIGxvcnMgZGVzIGRpZmbDqXJlbnRzIGF1ZGl0cyAoR2VuZXJhbCBEeW5hbWljcywgSVNPOTAwMSwgT0hTQVMxODAwMSwgZXRjLik7Cu+CpyBSZW5kcmUgY29tcHRlIGR1IHN1aXZpIGRlcyBvcMOpcmF0aW9ucyDDoCBjaGFxdWUgZGlyZWN0ZXVyIGRlIGTDqXBhcnRlbWVudCBsb3JzIGR1IGNoYW5nZW1lbnQgZGUgcXVhcnQ7Cu+CpyBWb2lyIGF1IHN1aXZpIGRlcyBidWRnZXRzIGV0IGVuIGFzc3VyZXIgbGUgcmVzcGVjdC4=

Update Thanks all for the answers the above function does indeed work there was a conditional statement I forgot to change elsewhere thats way :( embarrassing.

ericsicons
  • 1,475
  • 3
  • 23
  • 38

4 Answers4

2

Your function works, you just are not decoding the string before you pass it through it.

use it like remove_none_word_chars(base64_decode($string))

cmorrissey
  • 8,493
  • 2
  • 23
  • 27
1

Here is a way to delete non digit and letter chars

static function remove_none_word_chars($string) {
    return preg_replace('~[^\\pL\d]+~u', ' ', $string);
}

See it in action : http://3v4l.org/GP31i

BentoumiTech
  • 1,627
  • 14
  • 21
1

I believe you can use this function

$test = utf8_encode("your text here");
$new = utf8_decode($test);
Pang
  • 9,564
  • 146
  • 81
  • 122
1

To remove Non-Printing Characters, you can use a regular expression.

$data= preg_replace('/[^\x0A\x20-\x7E\xC0-\xD6\xD8-\xF6\xF8-\xFF]/','',$data);

// Or to preserve extended characters, use the below expression.
// Mind you many of these may still be non-printing.
$data= preg_replace('/(?!\n)[[:cntrl:]]+/','',$data);

This is from an answer to a previous question of mine to remove non-printing characters from a string destined for the error_log.

What this does it remove all characters that are not in the provided list, or (in the second example) that are control characters. The lists:

\x0A = [newline]
\x20-\x7E = [space] ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
\xC0-\xD6 = À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö
\xD8-\xF6 = Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö
\xF8-\xFF = ø ù ú û ü ý þ ÿ

As for encoding to UTF-8, this shouldn't really be too much of a problem, but there are functions available, such as utf8-encode, that may help. I believe you would have to call that on the string prior to removing non-printing characters. Please note, however, that if the string is not int he correct format, or already in UTF-8, this may make the string unreadable.

Community
  • 1
  • 1
Mike
  • 1,968
  • 18
  • 35
  • 1
    This works but I had to add /u $s = preg_replace('/[^\x0A\x20-\x7E\xC0-\xD6\xD8-\xF6\xF8-\xFF]/u', '', $string) – ericsicons Mar 17 '15 at 00:37