sort unique urls from log

Question

I need to get the unique URLs from a web log and then sort them. I was thinking of using grep, uniq, sort command and output this to another file

I executed this command:

cat access.log | awk '{print $7}' > url.txt

then only get the unique one and sort them:

cat url.txt | uniq | sort > urls.txt

The problem is that I can see duplicates, even though the file is sorted which means my command worked. Why?

score 26 · Accepted Answer · edited May 22 '20 at 00:09

26

uniq | sort does not work: uniq removes contiguous duplicates.

The correct way is sort | uniq or better sort -u. Because only one process is spawned.

edited May 22 '20 at 00:09

Giacomo1968

answered Nov 17 '11 at 16:08

mouviciel

score 5 · Answer 2 · answered Nov 17 '11 at 16:08

5

uniq needs its input sorted, but you sorted after uniq. Try:

$ sort -u < url.txt > urls.txt

answered Nov 17 '11 at 16:08

William Pursell

score 3 · Answer 3 · edited May 21 '20 at 23:53

3

Try something like this:

cat url.txt | sort | uniq

edited May 21 '20 at 23:53

Giacomo1968

answered Nov 17 '11 at 16:10

Lewis Norton

score 0 · Answer 4 · answered Jun 05 '18 at 11:17

For nginx access logs, this gives the unique URLs being called:

 sed -r "s/.*(GET|POST|PUT|DELETE|HEAD) (.*?) HTTP.*/\2/" /var/log/nginx/access.log | sort | uniq -u

4 Answers4