Counting unique lines in a text file
I have an alias for this kind of thing since I run into it so often:
alias cnt='sort -if |uniq -ic |sort -ifn' # case insensitive
alias CNT='sort |uniq -c |sort -n' # strict, case sensitive
This sorts the input (-i
ignores nonprinting characters, -f
ignores case) and then uses uniq
(which can only handle pre-sorted data, -i
is case insensitive, -c
counts repetitions), then sorts the counts numerically (-n
for numeric). (Note: the final case outputted by cnt
may be more capitalized than expected due to how the commands rectify case differences.)
Invoke this like:
cat 20150229.log |cnt
Arguments to cnt
will be passed to the final sort
command, so you can use flags like -r
to reverse the sorting. I recommend running it through tail
or something like awk '$1 > 5'
to eliminate all of the small entries.
Parsing XML
The above works great for random text files like logs. Parsing HTML or XML is a Bad Idea™ unless you have full knowledge of the exact formatting you'll be parsing.
That said, you have a grep
query with a flawed regular expression to match XML:
grep '<Account Id="*">'
This matches <Account Id="">
(as well as <Account Id=">
and <Account Id=""">
, which you may not want), but it won't match your example <Account Id="123456789012">
. The *
in that regex looks for zero or more of the previous character ("
). Here is a more thorough explanation.
You need a .
in there to represent any character (explanation here):
grep '<Account Id=".*">'
Additionally, grep
won't match full lines unless you give it the -x
flag, and I'm guessing you don't want that because it will then fail if there is surrounding whitespace (see the above Bad Idea™ link!). Here is a cheaper version of that grep, making use of my alias:
grep '<Account Id=' 20150229.log |cnt