Remove special HTML characters from string in PHP

Question

I found many results but for some reason nothing works for me! I've tried preg_replace with regex and also html_entity_decode, but no good...

I want to select words that has a hash mark prefix e.g. #WORD, which works just fine, but sometimes the hash mark is read as &rlm;#WORD and it misses up.

Example: This is a normal #hash_mark but ‏#this_isn't

as it appears: enter image description here

The regex I use to select words with hash mark prefix '~(?<=\s|^)#[^\s#]++~um'

In the question marked as a duplicate, the answer doesn't work for Unicode text, as seen in the image: enter image description here

The code does remove all special characters including Unicode text, what's required is only to replace the &rlm;# with a normal #

function remove_special_char($sentence){    
    return preg_replace('/[^a-zA-Z0-9_ %\[\]\.\(\)%&-]/s','',$sentence);                  
}

echo remove_special_char("hello مرحبا привет שלום");

Output:

hello

Well... your regex matches hashes *preceded by a space*. That hash is not preceded by a space. Should it be? Should your regex match something else? — deceze, Jul 31 '13 at 13:24
You could add the right-to-left marker to your positive lookback assertion, such as: `'~(?<=\s|^|‏)#[^\s#]++~um'`. — Phylogenesis, Jul 31 '13 at 13:24
@Phylogenesis the solution is the same, but the problem which he met was rather difficult to recognized by naked eyes. Because these characters were rendered same visual. — Telvin Nguyen, Jul 31 '13 at 14:18
If it "doesn't" work could you explain more? Include what you've done and the _actual_ text that it doesn't work on; not pictures. The site supports unicode so it's fine. — Ben, Jul 31 '13 at 22:36

score 1 · Answer 1 · edited May 23 '17 at 10:25

There was two different characters special_characters enter image description here

Let you look exactly what happened I've made some debug

        var_dump(ord('‏#')); //return ASCII value of this char
        $str1 = 'This is character 226 ‏#';

        $str1v1 = preg_replace('/[^a-zA-Z0-9_ %\[\]\.\(\)%&-]/s', '', $str1);

        var_dump(ord('#')); //return ASCII value of second char
        $str2 = "This is character 35 #";

        $str2v1 = preg_replace('/[^a-zA-Z0-9_ %\[\]\.\(\)%&-]/s', '', $str2);


        var_dump($str1v1);
        var_dump($str2v1);

        var_dump($str1);
        var_dump($str2);

Output:

int 226
int 35
string 'This is character 226 ' (length=22)
string 'This is character 35 ' (length=21)
string 'This is character 226 â€#' (length=26)
string 'This is character 35 #' (length=22)

Maybe you or your end user have done to copy and paste somewhere and they included the converted charcode like what you described (&rlm;#). Since they are rendered the same surface and make you confused.

To escape from those characters, I have used the regex in following line

preg_replace('/[^a-zA-Z0-9_ %\[\]\.\(\)%&-]/s', '', $str1);

The regex has been taken from PHP remove special character from string

There may be some slight confusion here. The UTF-8 encoding of the right-to-left marker (U+200F) is the 3 bytes `0xE2 0x80 0x8F`. Because of PHP's lack of internal unicode handling, the standard string functions act upon bytes, rather than characters. The ord() function returns the value of the first byte in the string supplied as its parameter - which in this case is 226. — Phylogenesis, Jul 31 '13 at 14:57
Good to know it. I did not know exactly the reason which you gave here. Thanks — Telvin Nguyen, Jul 31 '13 at 15:12
Very useful information, but it seems that the regex by the end does only work on non-Unicode characters, on the other case it messes up all the characters. — Khaled, Jul 31 '13 at 16:54

Remove special HTML characters from string in PHP

1 Answers1