4

I'm using Woodstox to process an XML that contains some entities (most notably >) in the value of one of the nodes. To use an extreme example, it's something like this:

<parent>&nbsp; &lt; &nbsp; &gt; &amp; &quot; &apos; &nbsp;</parent>

I have tried a lot of different configuration options for both WstxInputFactory (IS_REPLACING_ENTITY_REFERENCES, P_TREAT_CHAR_REFS_AS_ENTS, P_CUSTOM_INTERNAL_ENTITIES...) and WstxOutputFactory, but no matter what I try, the output is always something like this:

<parent>nbsp; &lt; nbsp; > &amp; " ' nbsp;</parent>

(&gt; gets converted to >, &lt; stays the same, &nbsp; loses the &...)

I'm reading the XML with an XMLEventReader created with

XMLEventReader reader = wstxInputFactory.createXMLEventReader(new StringReader(fulltext));

after configuring the WstxInputFactory.

Is there any way to configure Woodstox to just ignore all entities and output the text exactly as it was in the input String?

luthier
  • 2,674
  • 4
  • 32
  • 35
  • Were you able to resolve this issue? I'm facing a similar problem where < stays the same whereas > gets converted to > – Buzz Aug 25 '20 at 19:28
  • @Buzz I ended up doing something *really* hacky that I'm not very proud of, but it got the job done: before processing the XML, I replace all `>` (and `'` and `"`) in the input XML with something like `@@@HACKY_REPLACEMENT_FOR_GT@@@`, and then replace it back once the processing is done. It's probably the least elegant/efficient solution ever, but I just couldn't spend any more time on it. Hope this helps! :) – luthier Aug 27 '20 at 06:57

2 Answers2

0

The basic five XML entities (quot, amp, apos, lt, gt) will be always processed. As far as I know there is no way to get the source of them with Sax.

For the other entities you can process them manually. You can capture the events until the end of the element and concatenate the values:

    XMLInputFactory factory = WstxInputFactory.newInstance();
    factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
    XMLEventReader xmlr = factory.createXMLEventReader(
            this.getClass().getResourceAsStream(xmlFileName));

    String value = "";
    while (xmlr.hasNext()) {
        XMLEvent event = xmlr.nextEvent();
        if (event.isCharacters()) {
            value += event.asCharacters().getData();
        }
        if (event.isEntityReference()) {
            value += "&" + ((EntityReference) event).getName() + ";";
        }
        if (event.isEndElement()) {
            // Assign it to the right variable
            System.out.println(value);
            value = "";
        }
    }

For your example input:

<parent>&nbsp; &lt; &nbsp; &gt; &amp; &quot; &apos; &nbsp;</parent>

The output will be:

&nbsp; < &nbsp; > & " ' &nbsp;

Otherwise if you want to convert all the entities maybe you could use a custom XmlResolver for undeclared entities:

public class NaiveHtmlEntityResolver implements XMLResolver {

    private static final Map<String, String> ENTITIES = new HashMap<>();

    static {
        ENTITIES.put("nbsp", " ");
        ENTITIES.put("apos", "'");
        ENTITIES.put("quot", "\"");
        // and so on
    }

    @Override
    public Object resolveEntity(String publicID,
            String systemID,
            String baseURI,
            String namespace) throws XMLStreamException {
        if (publicID == null && systemID == null) {
            return ENTITIES.get(namespace);
        }
        return null;
    }
}

And then tell Woodstox to use it for the undeclared entities:

    factory.setProperty(WstxInputProperties.P_UNDECLARED_ENTITY_RESOLVER, new NaiveHtmlEntityResolver());
  • Thanks, Iñaki! What I'm trying is to keep the `gt&;` as it is instead of having it expanded, but at the very least this is a great starting point :) – luthier Feb 14 '19 at 18:33
0

First of all, you need to include actual code since "output is always something like this" makes no sense without explaining exactly how are you outputting content that is parsed: you may be printing events, using some library, or perhaps using Woodstox stream or event writer.

Second: there is difference in XML between small number of pre-defined entities (lt, gt, apos, quot, amp), and arbitary user-defined entities like what nbsp here would be. Former you can use as-is, they are already defined; latter only exist if you define them in DTD.

Handling of the two groups is different, too; former will always be expanded no matter what, and this is by XML specification. Latter will be resolved (unless resolution disabled), and then expanded -- or if not defined exception will be thrown. You can also specify custom resolver as mention by the other answer; but this will only be used for custom entities (here, &nbsp;).

In the end it is also good to explain not what you are doing as much as what you are trying to achieve. That will help suggest things better than specific questions of "how do I do X" which may not be the ways to go about.

And as to configuration of Woodstox, maybe this blog entry:

https://medium.com/@cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

will help (as well as 2 others in the series) -- it covers existing configuration settings.

StaxMan
  • 113,358
  • 34
  • 211
  • 239
  • Thank you for your answer! And apologies if my question was not as clear as it should have been. In summary: what I'm trying to achieve is that, for an input that contains, say `>`, after passing through an WstxInputFactory and an WstxOutputFactory, the output contains exactly the same, `>`, without expanding the `>`. Do you think it can be done? (thanks again!) – luthier Feb 14 '19 at 18:29
  • @LuTHieR I don't think it is possible exactly like that. Woodstox does offer character offsets for events, so it may be possible to figure out something that works, but there is no way to prevent resolution itself. Note that there are a few other things in input that can not be preserved (like whitespace around attributes; normalization of linefeeds), character entities (& followed by # and character code, semicolon). – StaxMan Feb 19 '19 at 01:19