19

I need to get the unique URLs from a web log and then sort them. I was thinking of using grep, uniq, sort command and output this to another file

I executed this command:

cat access.log | awk '{print $7}' > url.txt

then only get the unique one and sort them:

cat url.txt | uniq | sort > urls.txt

The problem is that I can see duplicates, even though the file is sorted which means my command worked. Why?

Ryan Berger
  • 9,644
  • 6
  • 44
  • 56
aki
  • 1,241
  • 2
  • 13
  • 43

4 Answers4

26

uniq | sort does not work: uniq removes contiguous duplicates.

The correct way is sort | uniq or better sort -u. Because only one process is spawned.

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
mouviciel
  • 66,855
  • 13
  • 106
  • 140
5

uniq needs its input sorted, but you sorted after uniq. Try:

$ sort -u < url.txt > urls.txt
William Pursell
  • 204,365
  • 48
  • 270
  • 300
3

Try something like this:

cat url.txt | sort | uniq
Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Lewis Norton
  • 6,911
  • 1
  • 19
  • 29
0

For nginx access logs, this gives the unique URLs being called:

 sed -r "s/.*(GET|POST|PUT|DELETE|HEAD) (.*?) HTTP.*/\2/" /var/log/nginx/access.log | sort | uniq -u

Reference: https://www.guyrutenberg.com/2008/08/10/generating-url-list-from-access-log-access_log/

Pankaj Garg
  • 1,272
  • 15
  • 21