0

I have a huge log that contains 100,000s of lines xml transactions

many lines contain duplicate entries eg Account id's I would like to grep/sed or awk those account id's sort and show unique results or count it.

Below is the pattern I am trying to grep/sed/awk

<Account Id="123456789012">

so far I've tried the following:

sort 20150229.log | grep '<Account Id="*">' | uniq | wc -l

but i get 0 results....

Please advise

thanks

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
user2301195
  • 35
  • 1
  • 6
  • 1
    Edit your question to include some concise, testable sample input and expected output so we can help you but at a glance - `*` means `zero or more repetitions of the preceding regexp segment`, you should use `[^"]*` instead of `*`. There are other issues with your approach too though. – Ed Morton Feb 29 '16 at 17:07
  • Use an XML/HTML parser (xmllint, xmlstarlet ...). – Cyrus Feb 29 '16 at 18:14
  • `sort` itself can produce unique lines for you. Try `grep -E ' – dawg Feb 29 '16 at 20:21

4 Answers4

1

Counting unique lines in a text file

I have an alias for this kind of thing since I run into it so often:

alias cnt='sort -if |uniq -ic |sort -ifn'  # case insensitive
alias CNT='sort |uniq -c |sort -n'         # strict, case sensitive

This sorts the input (-i ignores nonprinting characters, -f ignores case) and then uses uniq (which can only handle pre-sorted data, -i is case insensitive, -c counts repetitions), then sorts the counts numerically (-n for numeric). (Note: the final case outputted by cnt may be more capitalized than expected due to how the commands rectify case differences.)

Invoke this like:

cat 20150229.log |cnt

Arguments to cnt will be passed to the final sort command, so you can use flags like -r to reverse the sorting. I recommend running it through tail or something like awk '$1 > 5' to eliminate all of the small entries.

 

Parsing XML

The above works great for random text files like logs. Parsing HTML or XML is a Bad Idea™ unless you have full knowledge of the exact formatting you'll be parsing.

That said, you have a grep query with a flawed regular expression to match XML:

grep '<Account Id="*">'

This matches <Account Id=""> (as well as <Account Id="> and <Account Id=""">, which you may not want), but it won't match your example <Account Id="123456789012">. The * in that regex looks for zero or more of the previous character ("). Here is a more thorough explanation.

You need a . in there to represent any character (explanation here):

grep '<Account Id=".*">'

Additionally, grep won't match full lines unless you give it the -x flag, and I'm guessing you don't want that because it will then fail if there is surrounding whitespace (see the above Bad Idea™ link!). Here is a cheaper version of that grep, making use of my alias:

grep '<Account Id=' 20150229.log |cnt
Adam Katz
  • 14,455
  • 5
  • 68
  • 83
0

It's quite easy to use a parser. I like XML::Twig for this sort of job, because you can purge as you go.

But something like:

#!/usr/bin/env perl
use strict;
use warnings;

my %count_of;

sub count_unique_id {
    my ( $twig, $account ) = @_;
    my $id = $account->att('id'); 
    print "New ID: $id\n" unless $count_of{$id};
    $count_of{$id}++;
    $twig -> purge; 
}

my $twig = XML::Twig -> new ( twig_handlers => { 'Account' => \&count_unique_id } );
$twig -> parsefile ( 'your_file.xml'); 

foreach my $id ( keys %count_of ) { 
   print "$id => $count_of{$id}\n";
}

print "There were ", scalar keys %count_of, " unique IDs\n"; 
Sobrique
  • 52,974
  • 7
  • 60
  • 101
0

If you're confident about the regularity of the XML and do not feel the need to use an XML-aware tool, then the following may well suffice, and has some advantages, e.g. it does not require gawk while still being somewhat tolerant of small variations:

awk -v RS='<' '/^Account +Id *=/ { sub(/^[^=]*= *"/,""); sub(/".*/, ""); print}' |
sort | uniq

If you want to avoid the sort, then you can easily modify the awk script, e.g. as follows:

awk -v RS='<' '
 /^Account +Id *=/ { sub(/^[^=]*= *"/,""); sub(/".*/, ""); m[$0]}
 END {for (i in m) {print i}}' 
peak
  • 105,803
  • 17
  • 152
  • 177
0

You haven't shown us any testable sample input and expected output so it's a guess but this MAY be what you want:

awk 'sub(/.*<Account Id="/,"") && sub(/".*/,"") && !seen[$0]++' 20150229.log
Ed Morton
  • 188,023
  • 17
  • 78
  • 185