0

I am really new to regex and I was following other StackOverflow answers to make sed command to remove invalid XML characters.

sed -ie 's/[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]//g' myfile.xml

When I run this, it looks like it deletes a bunch of alphabets,,, For example, if it is company, it deletes o,m,p,a,y,etc. Especially lower cases.

There is something wrong with my regex OR maybe it doesn't think it as regex. Would you please help me? Thank you.

rlee
  • 3
  • 1
  • This is the StackOverflow I was following: https://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java I ran from terminal to test but not working. – rlee Apr 27 '20 at 17:33
  • @KamilCuk I followed "regex" part for removing invalid XML characters. Not other JAVA codes. That regex part is pretty universal and I believe it is compatible with sed. – rlee Apr 27 '20 at 17:40
  • Is your file unicode or ascii? – JNevill Apr 27 '20 at 17:50
  • 1
    No, it doesn't work, as I suspected, sed is not java. The `\u0020-\uD7FF` matches literally `u` `0` `2` `D` `7` `F`. The `\u` doesn't have a meaning in sed (except in GNU sed it means to change character to lower case). `\u` is just `u`. Your regex removes all characters except `c`, your regex is generally equal to `[^-02789DEFbcdf\n\ru0-\]` - so all characters except these are removed. I think the easiest would be to use a utility which can understand encodings, I suspect perl or python. – KamilCuk Apr 27 '20 at 17:52
  • @KamilCuk I see,,, can you give me an example how you can work with sed and unicode? Does \x work? – rlee Apr 27 '20 at 19:18
  • 1
    You can't (well, you can, but it's really hard and not practical). UTF-8 has codepoints with different length. For example the `\ud800\udc00-\udbff\udfff` is meant to represent a range of codepoints with 3 or more bytes - in sed, you would have to basically list all of the possible combinations.... And the java code will be working with any encoding - utf8, utf16, utf32 - because when reading the file you can translate. Here you would have to handle it separately, and they differ (on byte level, ie. sed level). TLDR - use another tool. Like perl or python. – KamilCuk Apr 27 '20 at 19:49
  • For sed and unicode characters you may have a look into [sed that supports unicode?](https://unix.stackexchange.com/questions/196780/). – U880D Apr 28 '20 at 07:33

0 Answers0