Can you help me with correct regexp from the sed syntaxis point of view? For now every regexp that i can write is marked by terminal as invalid.
Asked
Active
Viewed 771 times
0
-
`sed` can't determine uniqueness. You can use a regexp to extract the URLs from the logs, then pipe to `sort -u` to get the unique values. – Barmar Feb 18 '20 at 13:41
-
See https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url and https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url?r=SearchResults&s=2|29.6615 – Barmar Feb 18 '20 at 13:42
-
problem not with the getting uniq values, problem with the regexp valid for sed syntax. – SpaceBucks Feb 18 '20 at 13:44
-
Better pipe the result to uniq, if you do not want to loose the original sorting – Francesco Gasparetto Feb 18 '20 at 13:45
-
Please give an example of log lines, so that we can help you with the regexp. Anyway, I would use grep and not sed – Francesco Gasparetto Feb 18 '20 at 13:46
-
@franzisk `uniq` only works if the lines are sorted. – Barmar Feb 18 '20 at 13:47
-
Example of the log record: I need to extract the URL after http response code. 41.201.181.27 - [2019-04-06 18:22:02] "GET /images/stands/photo_mois.jpg HTTP/1.1" 304 - "http://example.com/popup.php?choix=mois" "Mozilla/4.0" "-" – SpaceBucks Feb 18 '20 at 13:48
-
2Show the regexp you tried. Remember that `sed` defaults to basic RE, it doesn't do PCRE. – Barmar Feb 18 '20 at 13:48
-
access logs don't contain full URLs, they leave out the `http://domain` prefix. – Barmar Feb 18 '20 at 13:49
-
1There are many tools for extracting data from webserver access logs, you shouldn't need to use your own regexp. – Barmar Feb 18 '20 at 13:49
-
@Barmar you are right, I was wrong – Francesco Gasparetto Feb 18 '20 at 13:50
-
So how i can extract uniq URL from the log with the grep and won't using written by myself regexp? – SpaceBucks Feb 18 '20 at 13:54
-
I tried an answer below. You don't need even grep – Francesco Gasparetto Feb 18 '20 at 13:56
1 Answers
2
If your log syntax is uniform, use this command
cut -f4 -d\" < logfile | sort -u
If you want to skip the query string from uniqness, use this
cut -f4 -d\" < logfile | cut -f1 -d\? | sort -u
Explanation
Filter the output with the cut command, take the 4th field (-f4) using " as separator (-d\"). The same with the second filter, using ? as separator

Francesco Gasparetto
- 1,819
- 16
- 20
-
1[The `cat`s are useless.](/questions/11710552/useless-use-of-cat) You are making some fairly bold assumptions about what the log format looks like; I'm guessing you assume they will be processing Apache logs. – tripleee Feb 18 '20 at 13:59
-
No, i have read all of the user comments, where he put a log line example. Thanks for the cat suggestion, I'll fix it – Francesco Gasparetto Feb 18 '20 at 14:00
-
Can i transform somehow the command from the above for the finding only POST unique records? – SpaceBucks Feb 19 '20 at 10:15
-
Just add | grep POST just before | sort -u, it would become cut -f4 -d\" < logfile | grep POST | sort -u. Learn grep, is very useful, it is just a filter for text parsing, very easy to use – Francesco Gasparetto Feb 19 '20 at 13:34