0

I am trying to unformat a XML to single line. (Using JAVA)

I trying to use following regex to replace.

input.replaceAll(">\\s+", ">").replaceAll("\\s+<", "<");

However, it also will remove the space in front and behind element. Which is unexpected.

For example:

Scenario 01

Before: <AAA>{space}{space}{space}</AAA>

After: <AAA></AAA>

Scenario 02

Before: <AAA>{space}{space}123{space}{space}</AAA>

After: <AAA>123</AAA>

Scenario 03

Before: <AAA>{space}A{space}B{space}C{space}</AAA>

After: <AAA>A{space}B{space}C</AAA>

Is there any way to unformat and avoid scenario above?

obl0702
  • 123
  • 11

2 Answers2

1

A Saxon solution:

Processor p = new Processor(false);
DocumentBuilder db = p.newDocumentBuilder();
db.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL);
XdmNode doc = db.build(new File(...));
Serializer s = p.newSerializer(new File(...));
s.serialize(doc.asSource());

This gives you quite a lot of control over the format of the output by setting properties on the Serializer object.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

This will only replace vertical whitespaces following tag ends and preceding tag starts, e.g. "\n", "\r" or combinations, and others.

input.replaceAll(">\\v+", ">").replaceAll("\\v+<", "<");

Excerpt from https://www.regular-expressions.info/shorthand.html says:

\v matches “vertical whitespace”, which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}].

TreffnonX
  • 2,924
  • 15
  • 23