0

i am working on a system that uses XML for saving messages, and sometimes costumers send weird sings like a password dot for example which results in a xml parse error with simplexml_load_string, i am using a replace at the moment this works great in most cases:

 return str_replace( ["&", "", "<", ">", "`", "~", '"' ], ' ', $text );

so for the password dot i added:

$message = str_replace("•", "", $message);

this works but i am wondering if there is an easier way to fix this instead of replacing everything, or i could only allow certain characters like a-z 0-9 spaces .,./?! etc which worked with preg_replace but then the emoticons from phones etc did not come trough, same for swedish and spanish characters like Å, Ä, Ñ and Ö. thanks!

edit: updated the dot replace that works but doenst fix the xml parse problem when using more weird characters

user3306814
  • 41
  • 1
  • 8
  • 1
    Did you mean `$message = str_replace("•", "", $message);` – The fourth bird Jul 25 '22 at 20:51
  • 1
    Are you building your XML document as a string or using XML-aware functions? https://3v4l.org/8dPXp – Chris Haas Jul 25 '22 at 21:00
  • @Thefourthbird ye thanks this works (i got this solution from this forum somewhere), but still if some other character appears i have to add a replace again, would like to have a general solution – user3306814 Jul 25 '22 at 21:02
  • @ChrisHaas its build up with strings and i dont have acces at to change this – user3306814 Jul 25 '22 at 21:03
  • Can you just use [`htmlspecialchars`](https://3v4l.org/pTnQ9) to safely pass all strings? – Chris Haas Jul 25 '22 at 21:12
  • @ChrisHaas no • wont change, they keep being dots and this is for more characters the case, i prefer to just ignore all that non-necessary characters. echo htmlspecialchars('•'); = • in source code – user3306814 Jul 25 '22 at 22:01
  • Make sure that the XML is declared as UTF-8 (XML declaration, HTTP header). `•` is an Unicode character, it does not need escaping in an UTF-8 XML string. – ThW Jul 25 '22 at 22:38
  • 1
    The solution is to write XML with XML-aware functions. If you don’t have access, I would change that, ask whoever you need to, or replace that system. Otherwise, use [UTF-8 all the way through](https://stackoverflow.com/a/279279/231316 ) as @ThW alluded to. Otherwise, manually pick the characters you consider safe and read through [characters classes and ranges](https://www.php.net/manual/en/reference.pcre.pattern.syntax.php) for RegEx. – Chris Haas Jul 25 '22 at 23:45
  • @ThW it is a php file calling this string, and in config i use $conn->set_charset( 'utf8mb4' ); but now when i am testing and inserting the • (password dot) directly in to the message in the database i dont get errors, for me this is weird – user3306814 Jul 25 '22 at 23:59
  • @ChrisHaas if i really want i could change this if i tell them this would help, if i can wouldn't it be better to use json or something like that? – user3306814 Jul 26 '22 at 00:01
  • 1
    Better is subjective. There’s nothing wrong with XML, it is just more verbose than newer formats. But if you are _really_ working with XML in the pipeline, and not just strings that _look_ like XML, this should be a simple thing unless you’ve been tasked with fixing a function called `sanitizeStringForXML()`. But, if your hands are tied, we’ve all been there, too. In that case, like I said, character ranges and classes with RegEx. Just be prepared to write a lot of tests. – Chris Haas Jul 26 '22 at 00:08
  • for me the regex would be the way to go for now, ive tried it before but then the emoticons did not work after that, its quite difficult to filter with regex (for me at least) i would just allow any alpha-characters (also foreign characters), numbers, emoticons and spaces dots ?.,! . also weird thing i just tested the chat and inserted the password dot and no errors, but when i have to fix a message i remove the dot in the database and it fixes the problem, its really weird... – user3306814 Jul 26 '22 at 00:23
  • “foreign characters” - foreign to whom? You are also saying “emoticons” which are totally a thing, but do you mean emoji or things like `:)`. For RegEx you need to come up with your list of safe characters, then negate that. Start small, work up. If you have character sequences like `:)`, considering replacing them with something safe such as `:smile:`, do your RegEx, and then undo. – Chris Haas Jul 26 '22 at 03:42
  • 2
    It sounds like you have a point in your application logic where you do not use UTF-8 or use a UTF-8 string as another (non Unicode) encoding. If you use UTF-8 everywhere here is not much difference between `•`, `ä` or ``. Removing only some of the ASCII incompatible characters will not help you - even the `ä` might break you logic. *btw* PHP has a special class for converting Unicode characters: [Transliterator](https://www.php.net/manual/en/class.transliterator.php) – ThW Jul 26 '22 at 09:06

0 Answers0