0

I have a large externally generated xml file that has some invalid characters, a backslash in my case. I know what to replace these fields with, so I can gedit a single file and fix it manually. However there are many of these files, all with the same problem. I would like to write a bash script to fix them all.

Problem The problematic section looks like this.

<root>
 <array>
  <dimension> dim="1">gridpoints</dimension>
  <field> a </field>
  <field> b </field>
  <field> c </field>
  <field> \00\00\00 </field>
  <field> \00\00\00 </field>
  <field> \00\00\00 </field>
  <set> 
   All the data 
  </set>
 </array>
</root>

Desired output

<root>
 <array>
  <dimension> dim="1">gridpoints</dimension>
  <dimension> dim="2">morepoints</dimension>
  <dimension> dim="3">evenmorepoints</dimension>
  <field> a </field>
  <field> b </field>
  <field> c </field>
  <field> d </field>
  <field> e </field>
  <field> f </field>
  <set> 
   All the data 
  </set>
 </array>
</root>

Fix so far I have already found a way to remove the offending backslashes using perl, but then I can't figure out how to edit the fields individually as the below code gets the desired solution, but with each field having entry "a"

#!/bin/bash
perl -CSDA -pe'
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > temp.xml
xmlstarlet ed -u "/root/array/field" -v "a" temp.xml > file_fixed.xml

I will also gladly take any advice on how to do this more efficiently. Thank you.

Edit As requested by zdim, I have added an example that is more representative of the full file I am dealing with.

<root>
 <path1>
  <array>
   <dimension> dim="1">gridpoints</dimension>
   <field> a </field>
   <field> b </field>
   <field> c </field>
   <field> \00\00\00 </field>
   <field> \00\00\00 </field>
   <field> \00\00\00 </field>
   <set> 
    All the data 
   </set>
  </array>
 </path1>
 <path2>
  <array>
   <dimension> dim="1">gridpoints</dimension>
   <field> Behaves Correctly </field>
  </array>
 </path2>
</root>

It should be noted that I receive these files as output from another program and then need to fix them before feeding them into the next. I am no where near experienced with xml, which is why I may have missed some obvious solutions.

Orange Pukeko
  • 157
  • 1
  • 2
  • 8
  • [What are invalid characters in XML](https://stackoverflow.com/a/28152666/724039) show that all character in the range `[#x20-#xD7FF]` are valid, and a backslash is character `#59`. – Luuk Apr 08 '22 at 13:45

1 Answers1

3

Use a proper XML parser.

With XML::LibXML, one way

use warnings;
use strict;
use feature 'say';

use XML::LibXML;

my $filename = shift // die "Usage: $0 file.xml\n";  #/ fix syntax hilite

my $doc = XML::LibXML->load_xml(location => $filename);

# Remove unwanted nodes
foreach my $node ($doc->findnodes('//field')) { 
    #say $node->toString;   
    if ($node->toString =~ m{\\00\\00\\00}) {
        say "Removing $node";
        $node->parentNode->removeChild($node);
    }   
}

# Add desired new nodes (right after the last <field> node)
my $last_field_node = ( $doc->findnodes('//field') )[-1];
my $field_node_name = $last_field_node->nodeName;
my $parent = $last_field_node->parentNode;

for ("E".."F") {
    my $new_elem = $doc->createElement( $field_node_name );
    $new_elem->appendText($_);
    $parent->insertAfter($new_elem, $last_field_node);
}

# Add other nodes (like the mentioned "dimension") the same way

print $doc->toString;

I use a basic regex to recognize a mode to remove, as given in the example. Please adjust the code as suitable to your actual input.

This adds new nodes after the last <field> node. But if we need to add right after the removed nodes, while there may be yet further <field> nodes, then first add after the last <field> node with that need be removed and only then remove them.

Or, perhaps you simply need to replace content of <field> nodes with '\00\00\00'

my @replacements = "AA" .. "ZZ";  # li'l list of token replacements 

foreach my $node ($doc->findnodes('//field')) { 
    if ($node->toString =~ m{\\00\\00\\00}) {
        say "Change $node -- remove child (text) nodes, add new";
        $node->removeChildNodes;
        $node->appendText(shift @replacements);
    }
}

An element's "value" is really a text node, which has a value. Instead of replacing that (text-child-node's) value directly it is better to drop (all) element's (text)-child-nodes and then add the desired new one.

This code then takes care of \00\00\00 if those need be simply replaced, drawing from some list of replacements. To also add <dimension> nodes use insertAfter as above.

There are modules for prettier printing, like XML::LibXML::PrettyPrint


With Mojo::DOM, one way

use warnings;
use strict;
use feature 'say';

use Path::Tiny;  # convenience, for "slurp"-ing a file
use Mojo::DOM;

my $filename = shift // die "Usage: $0 file.xml\n";  #/ fix syntax hilite

my $dom = Mojo::DOM->new( path($filename)->slurp );
# my $dom = Mojo::DOM->new->xml(1)->parse(path($filename)->slurp);

# Remove unwanted, by filtering them first
$dom->find("field")
    -> grep( sub { $_->text =~ m{\\00\\00\\00} } )
    -> each( sub { $_[0]->remove } );

# Or directly while iterating
# $dom->find("field")->each(
#     sub { $_[0]->remove if $_[0]->text =~ m{\\00} } );

# Add new ones, after last 'field'
foreach my $content ("E".."F") {
    my $tag = $dom->new_tag('field', $content);
    $dom->find('field')->last->append($tag);
}

say $dom;

Again, please adjust to the actual document structure.

An example. If new field nodes need be added right after the field nodes to be deleted (and not after some other field nodes further down), one way would be to first add after those nodes, while we can still identify those places, and only then delete them.

# Add new ones, after last 'field' that has \00\00\00 text in it
foreach my $content ("E".."F") {
    my $tag = $dom->new_tag('field', $content);
    $dom->find('field')->grep(sub { m{\\00\\00\\00} })->last->append($tag);
}

# Only now remove those 'field' nodes with \00\00\00
$dom->find("field")->each( 
    sub { $_[0]->remove if $_[0] =~ m{\\00\\00\\00} } );

With this library it is also easy to replace content of a node if that is desired (rather then add-and-remove).

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Hi, this fix doesn't work on the original file due to the fact that there are invalid characters and it doesn't load the xml file properly. Also, when used on the file in which I have removed the offending characters it adds the new elements in the wrong place as there are more elements in this file. – Orange Pukeko Apr 07 '22 at 08:06
  • @OrangePukeko I can only use what you show in the example, so this demo of course can't work with a different file -- including those "_more elements_". (I am not even sure what you mean by that.) It works on the XML you show, what I tested it with. Please adjust the code to your actual example? – zdim Apr 07 '22 at 09:05
  • @OrangePukeko (Or show a more representative example of the input file? Can add it to the end of the question. Then I can add a note for how to change the code for that, or just change it if that makes more sense.) – zdim Apr 07 '22 at 09:13
  • @OrangePukeko I've added another example, with a different library. Then I've added an example of what I thikn you may have meant with "_more elements_". Please clarify. – zdim Apr 07 '22 at 20:11
  • My apologies, I tried to keep the example minimal. I have updated the question. Hopefully that clarifies it, I have a large file of >100000 lines outputted by a program that I can't alter. I need to fix about five lines in that and so far I have been doing that by replacing them in an editor. I would like to automate that. Your proposed solution of replacing the nodes' content seems to be what I am looking for and I will try that now. I apologise for not asking a more direct question, as I am quite unfamiliar with both xml and perl. – Orange Pukeko Apr 08 '22 at 01:42
  • @OrangePukeko OK, I see, thank you! Questions: 1) So you want to _replace_ all `\00\00\00` in `field` nodes by other things (say, `d`,`e`,...) -- is that correct? Not to remove them and add (whatever number of) other lines, but to specifically replace those patterns in those nodes? 2) Add those `dimension` nodes ... where? in `path1` only? Or everywhere for `dim="1"...`? – zdim Apr 08 '22 at 03:00