0

I want to remove emojis from XML files. A typical example string could be something like:

<a> grêve &#55357;&#56628; SNCF  ➡️</a>

I want to have only:

<a>grêve SNCF</a>

I tried to use Nokogiri's noent option and some filters after the parse stage, but to_xml returns the emojis as HTML entities and I do not detect them anymore. It returns something like:

<a>&#x1F92C; gr&#xEA;ve  SNCF &#x1F534; &#x27A1;&#xFE0F;</a>
require 'nokogiri'

xml = Nokogiri::XML(%{
  <root>
    <aliens>
      <alien>
        <name>
           grêve &#55357;&#56628; SNCF  ➡️
        </name>
      </alien>
    </aliens>
  </root>
}) do |config|
  config.noent
end

puts xml

# emoticons
clean_xml_str = xml.to_xml
  .unpack('U*')
  .reject{ |e|
    # emoticons block
    e.between?(0x1F600, 0x1F6FF)  ||
    # basic block - control characters
    e.between?(0x0000, 0x001F) ||
    # Private Use Area
    e.between?(0xE000, 0xF8FF)
  }
  .pack('U*')

puts clean_xml_str

See sandbox on repl.it for more information.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Patrick Ferreira
  • 1,983
  • 1
  • 15
  • 31

1 Answers1

0

You're asking Nokogiri to do something that isn't really its job. Nokogiri is supposed to parse valid XML and those characters seem to be valid. In these situations we're forced to pre-process the file, then hand it off. The same thing happens with pathologically damaged XML or HTML; It's dirty and we feel dirty doing it, but it's entirely acceptable rather than jump through hoops after the fact.

I'd use a pattern, or a couple, to remove any characters out of the normal ASCII range, or whatever range you deem acceptable prior to passing the XML on to Nokogiri. For a quick and dirty example this strips anything out of the ASCII range but you'll want to fine-tune it as it's munging ê:

'<a> grêve &#55357;&#56628; SNCF  ➡️</a>'.gsub(/[^\x20-\x7e]+/, '')
# => "<a> grve &#55357;&#56628; SNCF  </a>"

or:

'<a> grêve &#55357;&#56628; SNCF  ➡️</a>'.gsub(/[^[:ascii:]]+/, '')
# => "<a> grve &#55357;&#56628; SNCF  </a>"

Simply add the encoded strings into the patterns, or run a second pass to handle them. Ruby's Regexp documentation will help you fine-tune it.

As far as the solution in "How do I remove emoji from string", it will also work, but it's going to be slower because it's iterating through every character. gsub with a pattern will pass it off to Ruby's regular expression engine which, if you pass it the entire XML file, will run much faster.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks for that detailed response, it makes sense, I'll test all this soon ( and try to remember to follow up the results here :) ) – Patrick Ferreira Dec 09 '19 at 19:41
  • If, by "follow up the results", you mean summarize or showing what you did, please don't. There are rare occasions when it's helpful but in general it confuses others looking for a similar solution. Otherwise see https://stackoverflow.com/help/someone-answers. – the Tin Man Dec 09 '19 at 20:49