4

This is a follow-up question to How to pretty print XML from the command line?.

Is there any tool in libxml2 that will allow me to align the attributes of each node as well? I have a large XML document whose logical structure I cannot change, but I would like to turn

<a attr="one" bttr="two" tttr="three" fttr="four"/>

into

<a attr   = "one"
   bttr   = "two"
   tttr   = "three"
   fttr   = "four"
   longer = "attribute" />
Community
  • 1
  • 1
Sean Allred
  • 3,558
  • 3
  • 32
  • 71

2 Answers2

3

xmllint has an option --pretty which supports three levels of prettyness. If this output:

<?xml version="1.0"?>
<a
    attr="one"
    bttr="two"
    tttr="three"
    fttr="four"
/>

is ok for you, then use --pretty 2 :

xmllint --pretty 2 - <<< '<a attr="one" bttr="two" tttr="three" fttr="four"/>'
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • My `xmllint` has no such option... what version do you have? I'm using `libxml version 20706` – Sean Allred Sep 17 '14 at 18:25
  • I'm using `xmllint: using libxml version 20901 ` – hek2mgl Sep 17 '14 at 18:26
  • `:(` Therein lies the problem, I suppose. My copy was last packaged `2013-01-30 14:59`... sigh. – Sean Allred Sep 17 '14 at 18:33
  • Several years later, and while this is the best answer I've found, it's still pretty broken. While it does pretty well with the attributes, it completely uglifies the rest of the elements: `xmllint --pretty 2 - <<< 'something'` is horrible. – rbellamy Feb 27 '16 at 02:13
  • @rbellamy I see. Looks weird! :) I guess the best thing you can do in that case is writing something on your own.. (or modify existing prettifiers) – hek2mgl Feb 27 '16 at 10:30
1

Try xml_pp with style "-s cvs"

You asked for something in libxml2. I don't know about that. But if you are willing to use something else, then read on below.

xml_pp is part of the XML::Twig library and has a bunch of different preconfigured styles.

You can specify a style via the "-s" (style) parameter.

If you just leave "-s" empty, then it will show all available styles. (It actually generate that list on the fly. So it's guarnteed to be fresh.)

$ xml_pp -s
Use of uninitialized value $opt{"style"} in hash element at /usr/bin/xml_pp line 100.
usage: /usr/bin/xml_pp [-v] [-i<extension>] [-s (none|nsgmls|nice|indented|indented_close_tag|indented_c|wrapped|record_c|record|cvs|indented_a)] [-p <tag(s)>] [-e <encoding>] [-l] [-f <file>] [<files>] at /usr/bin/xml_pp line 100.

Here's the same thing again but in a nicer list format. It turns out that the version I have installed supports 11 formats out of the box:

$ xml_pp -s 2>&1 | grep -Po '(?<=\[-s \()[^)]*' -o | tr '|' '\n' | nl
     1  none
     2  nsgmls
     3  nice
     4  indented
     5  indented_close_tag
     6  indented_c
     7  wrapped
     8  record_c
     9  record
    10  cvs
    11  indented_a

So let's try them all.

This is our input file:

$ cat in.xml
<a attr="one" bttr="two" tttr="three" fttr="four"/>

And these are all the styles:

$ for STYLE in $(echo "none nsgmls nice indented indented_close_tag indented_c wrapped record_c record cvs indented_a"); do echo; echo "==> Style: xml_pp -s $STYLE <=="; cat in.xml | xml_pp -s $STYLE | tee out.xml_pp.$STYLE.xml; echo; done

==> Style: xml_pp -s none <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s nsgmls <==
<a
attr="one"
bttr="two"
fttr="four"
tttr="three"
/>

==> Style: xml_pp -s nice <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s indented <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s indented_close_tag <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s indented_c <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s wrapped <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s record_c <==

<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s record <==

<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s cvs <==
<a
    attr="one"
    bttr="two"
    fttr="four"
    tttr="three"
/>

==> Style: xml_pp -s indented_a <==
<a
    attr="one"
    bttr="two"
    fttr="four"
    tttr="three"
/>

A bunch of these styles are equivalent for this small input file. They produce the same output:

$ sha256sum * | sort
452f5c19177d9cc6a54589168dbb1ee790c783a963110662e7dfae170bf997e4  out.xml_pp.cvs.xml
452f5c19177d9cc6a54589168dbb1ee790c783a963110662e7dfae170bf997e4  out.xml_pp.indented_a.xml
8e119bb50bcbf3d72159c96139cf328f46a0de259410acdd344f26e52f033996  out.xml_pp.nsgmls.xml
d1ed9a4d1ebf8b9f1d012577809909e91e1ba0fc01b5afc8ff1302ca9dced617  out.xml_pp.record_c.xml
d1ed9a4d1ebf8b9f1d012577809909e91e1ba0fc01b5afc8ff1302ca9dced617  out.xml_pp.record.xml
e0d13f80ddc48876678c62e407abd3ab1eac8481a82d5aabb1514e24aee4717c  in.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.indented_close_tag.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.indented_c.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.indented.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.nice.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.none.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.wrapped.xml

None of these style are exactly what you wanted.

But "cvs" is pretty close. (And "indented_a" produces identical output.)

Afterthoughts: bit dirty

Afterthoughts: Output feels a little dirty.

(a) Some of the files just start with a blank line for no good reason...

$ grep '^$' * -n
out.xml_pp.record_c.xml:1:
out.xml_pp.record.xml:1:

(b) ... and some of the files just have no line terminators at all:

$ file *
in.xml:                            ASCII text
out.xml_pp.cvs.xml:                ASCII text
out.xml_pp.indented_a.xml:         ASCII text
out.xml_pp.indented_close_tag.xml: ASCII text, with no line terminators
out.xml_pp.indented_c.xml:         ASCII text, with no line terminators
out.xml_pp.indented.xml:           ASCII text, with no line terminators
out.xml_pp.nice.xml:               ASCII text, with no line terminators
out.xml_pp.none.xml:               ASCII text, with no line terminators
out.xml_pp.nsgmls.xml:             ASCII text
out.xml_pp.record_c.xml:           ASCII text
out.xml_pp.record.xml:             ASCII text
out.xml_pp.wrapped.xml:            ASCII text, with no line terminators

-- The thing seems to be that xml_pp does not at a trailing newline after the last line. So if you only have ONE line then there will be no newline byte in there. Quite weird.

Looks like this:

$ wc --lines *
  5 out.xml_pp.cvs.xml
  5 out.xml_pp.indented_a.xml
  0 out.xml_pp.indented_close_tag.xml
  0 out.xml_pp.indented_c.xml
  0 out.xml_pp.indented.xml
  0 out.xml_pp.nice.xml
  0 out.xml_pp.none.xml
  5 out.xml_pp.nsgmls.xml
  1 out.xml_pp.record_c.xml
  1 out.xml_pp.record.xml
  0 out.xml_pp.wrapped.xml
 17 total

This here is how I like to add a trailing LF (0x0A byte) if none is present:

$ mkdir 1; mv out.*.xml 1/; cp -r 1/ 2/

$ pcregrep -LMr '\n\Z' 2/ | xargs -n1 --no-run-if-empty -- sed -i -e '$a\' --

$ diff --recursive 1/ 2/ | head
diff --recursive 1/out.xml_pp.cvs.xml 2/out.xml_pp.cvs.xml
6c6
< />
\ No newline at end of file
---
> />
diff --recursive 1/out.xml_pp.indented_a.xml 2/out.xml_pp.indented_a.xml
6c6
< />
\ No newline at end of file

Looks like this afterwards:

$ cd 2/

$ wc --lines *
  6 out.xml_pp.cvs.xml
  6 out.xml_pp.indented_a.xml
  1 out.xml_pp.indented_close_tag.xml
  1 out.xml_pp.indented_c.xml
  1 out.xml_pp.indented.xml
  1 out.xml_pp.nice.xml
  1 out.xml_pp.none.xml
  6 out.xml_pp.nsgmls.xml
  2 out.xml_pp.record_c.xml
  2 out.xml_pp.record.xml
  1 out.xml_pp.wrapped.xml
 28 total
StackzOfZtuff
  • 2,534
  • 1
  • 28
  • 25