1

I am trying to parse XML files that contain special characters as &, ", ', < or > in the data. I would like to know how to do it properly.

NB: the files are quite huge and I do not have the hand on them to modify them. So I am looking for an automated way to transform the file before parsing it, such as using regular expressions or other stuff like them.

halfer
  • 19,824
  • 17
  • 99
  • 186

1 Answers1

0

Are these well-formed XML files or ill-formed XML files?

If they are ill-formed then you can't use an XML parser to process them. You will need to determine exactly how the data format differs from well-formed XML and write custom parsing code to handle the precise situations that occur in your data.

It might be difficult. For example if your file contains

<expr>a<b<c</expr>

then working out which < symbols are markup and which are data requires some serious analysis (or guesswork). And in the general case, the task is impossible.

Of course standards are there for a reason and it's far better if the person creating the data files reads the spec and follows it. That's the only way to do it "properly".

Michael Kay
  • 156,231
  • 11
  • 92
  • 164