Delete a SPECIFIC duplicate line from XML file in place

Question

I've been reading about deleting duplicate lines all over stack. There's perl, awk, and sed solutions, however none as specific as I want and I'm at a loss.

I want to delete the duplicate <path> tags from this XML case INSENSITIVELY with a quick bash/shell perl command. Leave all other duplicate lines (like <start> and <end>) intact!

Input XML:

  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep 
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">                 
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   <------ Duplicate line to REMOVE
    </userinterface>
  </package>

So far I've been able to grab the duplicate lines, but don't know how to remove them. The following

grep -H path *.[Xx][Mm][Ll] | sort | uniq -id

Gives the result:

test.xml:          <upath>/example/dir/here</upath>

How do I remove that line now?

Doing the perl version or awk version below erases the <start> and <end> dates as well.

perl -i.bak -ne 'print unless $seen{lc($_)}++' test.xml
awk '!a[tolower($0)]++' test.xml > test.xml.new

Aside: `[Xx][Mm][Ll]` is silly. Why not just consistently use lowercase `.xml`? — John Kugelman, Apr 20 '16 at 21:30
In what way is `...` a duplicate of `...`? Your final awk solution should work just fine. — William Pursell, Apr 20 '16 at 21:33
@WilliamPursell notice there are TWO start and TWO end tags but for different model types: A and B. — dlite922, Apr 20 '16 at 21:37
@WilliamPursell `2016-04-20` and `2017-04-20` each appear twice in the file. — ThisSuitIsBlackNot, Apr 20 '16 at 21:38
You need to define the problem more clearly, and the solution will be clear. What is it that distinguishes a duplicate that should be deleted from one that should not be? Is it that the duplicates which should be deleted occur in the same object? Is it that they contain valid path names? Is it that they do not contain date-like strings? — William Pursell, Apr 20 '16 at 21:54
*"I want to delete the duplicate tags from this XML case INSENSITIVELY"* You mean you want some code that will just delete those damn lines no matter *how much* they meant to me or *whose funeral* I was on my way to? — Borodin, Apr 20 '16 at 22:27

Rany Albeg Wein · Answer 1 · 2016-04-21T01:19:26.087

The following script accepts an XML file as a first argument, uses xmlstarlet ( xml in the script ) to parse the XML tree and an Associative Array ( requires Bash 4 ) to store unique <upath> node values.

#!/bin/bash

input_file=$1
# XPath to retrieve <upath> node value.
xpath_upath_value='//package/userinterface/upath/text()'
# XPath to print XML tree excluding  <userinterface> part.
xpath_exclude_userinterface_tree='//package/*[not(self::userinterface)]'
# Associative array to help us remove duplicated <upath> node values.
declare -A arr

print_userinterface_no_dup() { 
    printf '%s\n' "<userinterface>"
    printf '<upath>%s</upath>\n' "${arr[@]}"
    printf '%s\n' "</userinterface>"
}

# Iterate over each <upath> node value, lower-case it and use it as a key in the associative array.
while read -r upath; do
    key="${upath,,}"
    # We can remove this 'if' statement and simply arr[$key]="$upath"
    # if it doesn't matter whether we remove <upath>foo</upath> or <upath>FOO</upath>
    if [[ ! "${arr[$key]}" ]]; then
        arr[$key]="$upath"
    fi
done < <(xml sel -t -m "$xpath_upath_value" -c \. -n "$input_file")

printf '%s\n' "<package>"

# Print XML tree excluding <userinterface> part.
xml sel -t -m "$xpath_exclude_userinterface_tree" -c \. "$input_file"

# Print <userinterface> tree without duplicates.
print_userinterface_no_dup

printf '%s\n' "</package>"

Test ( script name is sof ):

$ ./sof xml_file
<package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
      <model type="B">                 
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
    </models>
    <userinterface>
        <upath>/Example/Dir/Here2</upath>
        <upath>/Example/Dir/Here</upath>
    </userinterface>
</package>

If my comments are not making the code clear enough for you, please ask and I'll answer and edit this solution accordingly.

My xmlstarlet version is 1.6.1, compiled against libxml2 2.9.2 and libxslt 1.1.28.

Worth a +1 just because it suggests using an XML parser, not a regex. — Sobrique, Apr 21 '16 at 19:08

score 2 · Answer 2 · edited May 23 '17 at 12:16

If you're parsing XML, you really should use a parser. There are multiple options for this - but DON'T use regular expressions, because they're a route to really brittle code - for all the reasons you're finding.

See: parsing XML with regex.

But the long and short of it is - XML is a contextual language. Regular expressions aren't. There are also some perfectly valid variances in XML, which are semantically identical, the regex won't handle.

E.g. Unary tags, variable indentation, paths to tags in different location and line wrapping.

I could format your source XML a bunch of different ways - all of which would be valid XML, saying the same thing. But which would break regex based parsing. That's something to be avoided - one day, mysteriously, your script will break for no particular reason, as the result of an upstream change that's valid within the XML spec.

Which is why you should use a parser:

I like XML::Twig which is a perl module. You can do what you want something like this:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig; 

my %seen; 

#a subroutine to process any "upath" tags. 
sub process_upath {
   my ( $twig, $upath ) = @_; 
   my $text = lc $upath -> trimmed_text;
   $upath -> delete if $seen{$text}++; 
}

#instantiate the parser, and configure what to 'handle'. 
my $twig = XML::Twig -> new ( twig_handlers => { 'upath' => \&process_upath } );
   #parse from our data block - but you'd probably use a file handle here. 
   $twig -> parse ( \*DATA );
   #set output formatting
   $twig -> set_pretty_print ( 'indented_a' );
   #print to STDOUT.
   $twig -> print;

__DATA__
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>   
        <end>2017-04-20</end>    
      </model>
      <model type="B">                 
        <start>2016-04-20</start>     
        <end>2017-04-20</end>        
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   
    </userinterface>
  </package>

This is the long form, to illustrate the concept, and it outputs:

<package>
  <id>1523456789</id>
  <models>
    <model type="A">
      <start>2016-04-20</start>
      <end>2017-04-20</end>
    </model>
    <model type="B">
      <start>2016-04-20</start>
      <end>2017-04-20</end>
    </model>
  </models>
  <userinterface>
    <upath>/Example/Dir/Here</upath>
    <upath>/Example/Dir/Here2</upath>
  </userinterface>
</package>

It can be reduced down considerably though, via the parsefile_inplace method.

Thanks, but a bit overkill. Ed Morton fixed the problem in my awk. — dlite922, Apr 21 '16 at 16:36
Awk is a bad idea, because it doesn't do context. XML is a contextual language, so that sort of solution will always be brittle, and prone to breaking with perfectly valid changes to the input. See http://stackoverflow.com/a/1732454/2566198 — Sobrique, Apr 21 '16 at 18:57
That's ok, I'm just doing this once or twice until I fix a bug in my app. Otherwise I'd be editing the XML's by hand. — dlite922, Apr 22 '16 at 15:59

score 1 · Answer 3 · answered Apr 20 '16 at 21:39

1

If you want to ignore only duplicate lines right after each other, you can store the previous line and compare to that. For ignoring the case you can use tolower() in the comparison on both sides:

awk '{ if (tolower(prev) != $0) print; prev = $0 }'

answered Apr 20 '16 at 21:39

fejese

4,601
4
29
36

Unfortunately sometimes these lines are not one after another, there could be a third in the middle of them that's different. I'll update the question with this in mind. – dlite922 Apr 20 '16 at 21:45

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

It looks like you're working with XML. Would you like to parse it?

Hey, I've never done it with Perl before, but there's an Introductory Tutorial and everything... which wasn't super straightforward. Reading the XML::SAX::ParserFactory and XML::SAX::Base I came up with the code you see at the bottom of this answer.

The question was updated to not have adjacent lines; previously:

Okay, I'm seeing that you've got two <start> tags with dates that match and two <end> tags with dates that match in the whole file, but those are in different sections. If all your duplicate lines are effectively also adjacent, as they ~~are~~ were in your example, you need only use the uniq command from GNU Coreutils or an equivalent. This command could ignore case through the right use of the LC_COLLATE environment variable setting, but honestly, I found it very hard to spot an example or read how to use LC_COLLATE to ignore case.

Continuing with a parser:

#!/usr/bin/perl
use XML::SAX;

my $parser = XML::SAX::ParserFactory->parser(
    Handler => TestXMLDeduplication->new()
);

my $ret_ref = $parser->parse_file(\*TestXMLDeduplication::DATA);
close(TestXMLDeduplication::DATA);

print "\n\nDuplicates skipped: ", $ret_ref->{skipped}, "\n";
print "Duplicates cut: ", $ret_ref->{cut}, "\n";

package TestXMLDeduplication;
use base qw(XML::SAX::Base);

my $inUserinterface;
my $inUpath;
my $upathSeen;
my $defaultOut;
my $currentOut;
my $buffer;
my %seen;
my %ret;

sub new {
    # Idealy STDOUT would be an argument
    my $type = shift;
    #open $defaultOut, '>&', STDOUT or die "Opening STDOUT failed: $!";
    $defaultOut = *STDOUT;
    $currentOut = $defaultOut;
    return bless {}, $type;
}

sub start_document {
    %ret = ();
    $inUserinterface = 0;
    $inUpath = 0;
    $upathSeen = 0;
}

sub end_document {
    return \%ret;
}

sub start_element {
    my ($self, $element) = @_;

    if ('userinterface' eq $element->{Name}) {
      $inUserinterface++;
      %seen = ();
    }
    if ('upath' eq $element->{Name}) {
      $buffer = q{};
      undef $currentOut;
      open($currentOut, '>>', \$buffer) or die "Opening buffer failed: $!";
      $inUpath++;
    }

    print $currentOut '<', $element->{Name};
    print $currentOut attributes($element->{Attributes});
    print $currentOut '>';
}

sub end_element {
    my ($self, $element) = @_;

    print $currentOut '</', $element->{Name};
    print $currentOut '>';

    if ('userinterface' eq $element->{Name}) {
      $inUserinterface--;
    }

    if ('upath' eq $element->{Name}) {
      close($currentOut);
      $currentOut = $defaultOut;
      # Check if what's in upath was seen (lower-cased)
      if ($inUserinterface && $inUpath) {
    if (!exists $seen{lc($buffer)}) {
          print $currentOut $buffer;
    } else {
      $ret{skipped}++;
      $ret{cut} .= $buffer;
    }
    $seen{lc($buffer)} = 1;
      }
      $inUpath--;
    }
}

sub characters {
    # Note that this also capture indentation and newlines between tags etc.
    my ($self, $characters) = @_;

    print $currentOut $characters->{Data};
}

sub attributes {
    my ($attributesRef) = @_;
    my %attributes = %$attributesRef;

    foreach my $a (values %attributes) {
        my $v = $a->{Value};
      # See also XML::Quote
      $v =~ s/&/&amp;/g;
      $v =~ s/</&lt;/g;
      $v =~ s/>/&gt;/g;
      $v =~ s/"/&quot;/g;
    print $currentOut ' ', $a->{Name}, '="', $v, '"';
    }
}

__DATA__
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>   
        <end>2017-04-20</end>    
      </model>
      <model type="B">                 
        <start>2016-04-20</start>     
        <end>2017-04-20</end>        
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   
    </userinterface>
    <userinterface>
      <upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/<b>here</b></upath>   
    </userinterface>
  </package>

This doesn't work by lines any longer and instead finds upath tags inside userinterface tags which it removes if they're duplicates within that parent group. The surrounding indentation and newlines are retained. Also it would get kind of weird if there were upath tags within upath tags.

It looks like this:

$ perl saxEG.pl
<package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
      <model type="B">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>

    </userinterface>
    <userinterface>
      <upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>

    </userinterface>
  </package>
Duplicates skipped: 2
Duplicates cut: <upath>/example/dir/here</upath><upath>/example/dir/<b>here</b></upath>

score 0 · Accepted Answer · answered Apr 21 '16 at 01:51

0

$ awk '!(/<upath>/ && seen[tolower($1)]++)' file
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
    </userinterface>
  </package>

answered Apr 21 '16 at 01:51

Ed Morton

188,023
17
78
185

1

hahaha, Drive-By Downvoter... chuckle. I don't know who it was but I got you buddy! It's the answer i'm looking for. Yeah I could write a program for it but in my situation it's a patch until the next release of my software. I had to edit these XML's by hand each day as they came in just to get them through my application. I needed a one-liner to put in a cron until I found the bug and fixed it. Yeah, sure I could use XMLStarlet and Perl XML to /program/ it, but why write a program when a one liner would do? ! Thank you! – dlite922 Apr 21 '16 at 16:35
Bonus points if you can do this inline? I need to apply this to a directory of XML's without using for loop and temp files like so: `for afile in *.xml; do awk '...' $afile > to $afile.tmp && mv $afile.tmp $afile`. it seems like a guru would come up with a better method that I'm not seeing to awk the file in place. – dlite922 Apr 21 '16 at 16:40
With GNU awk just add the `-i inplace` flag. Otherwise the tmp file is the way to do it. Also consider using `find ... -print0 | xargs -0` instead of a for loop. – Ed Morton Apr 21 '16 at 16:44

Delete a SPECIFIC duplicate line from XML file in place

5 Answers5

It looks like you're working with XML. Would you like to parse it?

The question was updated to not have adjacent lines; previously: