I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.
The (current) problem is that ampersand characters are not always escaped properly, so I need to convert &
into &
If &
is already there, I don't want to change it to &
. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>;
is preserved.
Where <characters>
is some set of characters defining an entity between the initial &
and the closing ;
. In particular, <
and >
are not literals that would otherwise denote an XML element.
Now, when parsing, if I see &<characters>
I don't know whether I'll run into a ;
, a (space), end-of-line, or another
&
. So I think that I have to remember <characters>
as I look ahead for a character that will tell me what to do with the original &
.
I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String)
won't work. Or is there a Java regex that can solve this problem?
Remember: there could be multiple replacements per line.
(I'm aware of this question, but it does not provide the answer that I am looking for.)