1

I have a $text = "Hello üäö$"

I wanted to remove just emoji's from the text using xquery. How can i do that?

Expected result : "Hello üäö$"

i tried to use:

replace($text, '\p{IsEmoticons}+', '')

but didn't work.

it just removed smiley's

Result now: "Hello üäö$" Expected result : "Hello üäö$"

Thanks in advance :)

  • I used replace($test,"\p{So}+", '') as well but it still not removed all symbols result: "Hello üäö$" [link](https://www.w3.org/TR/xmlschema-2/#cces) – Prem Sagar J Nov 23 '21 at 09:20
  • Interesting, it seems the characters ` `, ``, and ``are not part of the `\p{IsEmoticons}` class, at least not in the version of Unicode Saxon 10.6 uses at https://xqueryfiddle.liberty-development.net/94hwpi9. – Martin Honnen Nov 23 '21 at 11:58
  • You will need to enumerate various categories those characters belong to e,g. `'[\p{IsEmoticons}\p{So}]'` as the second argument of `replace` with remove ` `, will need to check or check yourself to which category or categories the other characters belong. – Martin Honnen Nov 23 '21 at 12:09
  • Duplicate of https://stackoverflow.com/questions/70070385/how-can-i-remove-emojis-from-text-using-xquery/70070562#70070562 – line-o Nov 25 '21 at 14:39

1 Answers1

1

I outlined the approach in my answer to the original question, which I updated based on your comment asking about how to strip out .

Quoting from that expanded answer:

The "Emoticons" block doesn't contain all characters commonly associated with "emoji." For example, (Purple Heart, U+1F49C), according to a site like https://www.compart.com/en/unicode/U+1F49C that lets you look up Unicode character information, is from:

Miscellaneous Symbols and Pictographs, U+1F300 - U+1F5FF

This block is not available in XPath or XQuery processors, since it is neither listed in the XML Schema 1.0 spec linked above, nor is it in Unicode block names for use in XSD regular expressions—a list of blocks that XPath and XQuery processors conforming to XML Schema 1.1 are required to support.

For characters from blocks not available in XPath or XQuery, you can manually construct character classes. For example, given the purple heart character above, we can match it as follows:

replace("Purple  heart", "[🌀-🗿]", "")

This returns the expected result:

Purple  Heart

This approach can be applied to , , or any other character:

  1. Locate the character's unicode block.
  2. Craft your regular expression with the block name (if available in XPath) or character class.

Alternatively, rather than locating the blocks of characters you want to strip out, you could identify the blocks of characters you want to preserve. For example, given the example string in the original post, perhaps the goal is to preserve only those characters in the "Basic Latin" block. To do so, we can match characters NOT in this block via the \P Category Escape:

xquery version "3.1";

let $text := "Hello    üäö$"
return
    replace($text, "\P{IsBasicLatin}", "")

This query returns:

Hello    $

Notice that this has stripped out the characters with diacritics, which perhaps isn't desired. These characters with diacritics belong to the Latin-1 Supplement block. To preserve characters from both the Latin and Latin-1 Supplement blocks, we'd need to adjust the query as follows:

xquery version "3.1";

let $text := "Hello    üäö$"
return
    replace($text, "[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]", "")

... which returns:

Hello    üäö$

This now preserves the characters with diacritics.

To be precise about the characters you preserve or remove, you need to consult the Unicode blocks and charts.

Joe Wicentowski
  • 5,159
  • 16
  • 26
  • Thank you for the answer, but if i have many of such characters, is their any general way to eliminate all the charecters at once ? – Prem Sagar J Nov 25 '21 at 07:57
  • I have expanded my answer with a more general approach. If you find my answer to this question and/or your previous question is correct, please mark it as the answer so other users know it worked for you. – Joe Wicentowski Nov 26 '21 at 16:51