0

i want to remove emojis from XML files. A typical example string could be something like:

input: <UserName>JANE - MARIE &#55357;&#56628➡️</UserName>.

I want to have only:

OUTPUT : <UserName>JANE - MARIE</UserName>.

I tried to use sed but im not good with regular expression , can anyone help me , or suggerd me another function ?

THANKS

  • 1
    Welcome to SO, please do add your efforts in form of code in your question(which is highly encouraged on SO), thank you(not my downvote btw). – RavinderSingh13 Apr 08 '21 at 15:01
  • @Amine Al Arbi - You say you _want to remove emojis_ - why did you also remove a space then in your example? – Armali Apr 08 '21 at 15:56
  • Do you really have and want to handle malformed HTML character references without `;`? – Armali Apr 08 '21 at 16:03

1 Answers1

0

It looks like you want to remove non-ASCII characters. This depends if you accept the Name with Unicode (e.g. names with á, é, í, ó, ú, ü, ñ etc). In case this simple approach (remove non-ascii characters) is sufficient with your requirement:

LANG=C sed -i 's/[\d128-\d255]//g' <FILENAME>

Tested on my side:

$ LANG=C sed -i 's/[\d128-\d255]//g' /tmp/x.txt
$ cat /tmp/x.txt
<UserName>JANE - MARIE &#55357;&#56628</UserName>.
azbarcea
  • 3,323
  • 1
  • 20
  • 25