0

I am reading a document which may contain XML entities like &#160.

Since I need to export txt file, I manually have to convert the entities from XML to text.

As you can see below.

reader = new BufferedReader(new InputStreamReader(is, "utf-8"));
while ((s = reader.readLine()) != null) {
 if (s.equals("&#160"))
   s= " ";
}

Since there are many xml entities, and I want to convert them all to text like &#160->space, and prefer to avoid if then, is there a generic way to do it?

skaffman
  • 398,947
  • 96
  • 818
  • 769
Dejell
  • 13,947
  • 40
  • 146
  • 229

2 Answers2

2

When you extract the number from  , you can do this:

(new String(new byte[]{(byte)160}, "ISO-8859-1")).

Here are the entity mappings: HTML ISO-8859-1 Reference

padis
  • 2,314
  • 4
  • 24
  • 30
1

I believe what you're talking about is called HTML (not XML) decoding. There is a URLDecoder class which does this for URLs (which may be what you're decoding). There is also a more general class in Apache commons for HTML decoding (specified in this question).

Edit: I was unaware of the difference between HTML and XML escapes/entities, thanks for the clarification. It appears from this question that Apache commons has a library for decoding XML entities but the standard Java library does not.

Community
  • 1
  • 1
Pace
  • 41,875
  • 13
  • 113
  • 156
  • 1
    I am actually looking for XML decoding.   is XML entity and not HTML which will be &nbps;. – Dejell Feb 01 '11 at 21:53
  • 1
    URL decoding changes `%20` to a space; entity decoding changes ` ` or ` ` to a space, or ` ` to a non-breaking space - "Numeric character entity references" are valid in both XML and HTML, not XML only. – Stephen P Feb 02 '11 at 00:33