I want to remove emojis from XML files. A typical example string could be something like:
<a> grêve �� SNCF ➡️</a>
I want to have only:
<a>grêve SNCF</a>
I tried to use Nokogiri's noent
option and some filters after the parse stage, but to_xml
returns the emojis as HTML entities and I do not detect them anymore.
It returns something like:
<a>🤬 grêve SNCF 🔴 ➡️</a>
require 'nokogiri'
xml = Nokogiri::XML(%{
<root>
<aliens>
<alien>
<name>
grêve �� SNCF ➡️
</name>
</alien>
</aliens>
</root>
}) do |config|
config.noent
end
puts xml
# emoticons
clean_xml_str = xml.to_xml
.unpack('U*')
.reject{ |e|
# emoticons block
e.between?(0x1F600, 0x1F6FF) ||
# basic block - control characters
e.between?(0x0000, 0x001F) ||
# Private Use Area
e.between?(0xE000, 0xF8FF)
}
.pack('U*')
puts clean_xml_str
See sandbox on repl.it for more information.