How to extract text from access log?

Question

I am very new in this. I am trying to extract some text from my access log in a new file.
My log file is like this:

111.111.111.111 - - [02/Jul/2021:18:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/call-log?roomNo=5003" "Mozilla etc etc etc etc"
111.111.111.111 - - [02/Jul/2021:20:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/resevation-log?roomNo=4003" "Mozilla etc etc etc etc"

I want to extract in below format in a new file.

02/Jul/2021:18:35:19 +0000, call-log, 5003
02/Jul/2021:20:35:19 +0000, resevation-log, 4003

Till now I have managed to do this basic awk command:

awk '{print $4,$5,",",$11}' < /file.log

Which gives me the below output:

[02/Jul/2021:18:35:19 +0000] , "https://example.com/some/text/call-log?roomNo=5003"

Most of the time, it can be done with a regex and a substition. For that use `sed`, see https://www.grymoire.com/Unix/Sed.html#uh-1 . You can learn regex with fun with https://regexcrossword.com/ — KamilCuk, Jul 04 '21 at 16:30
@oguzismail removing the awk and sed tags from the question was an odd thing to do when those are the mandatory POSIX text processing tools. I for one would never see a question tagged "unix-text-processing" and I suspect that's probably true of many other awk and/or sed experts looking to help people on this forum. — Ed Morton, Jul 04 '21 at 17:37
Since the discussion wasn't tagged with anything I look for (e.g. awk and sed) I had no idea it existed til now. It was tagged with the useless "discussion", "tags", and "tag-tips" tags. I wonder how many questions we're all missing seeing now that sed, awk, etc. tags are being stripped from the questions. I have no plans to run around adding sed and awk tags back into questions that someone else removed them from. Hopefully common sense will prevail over time and if not - whatever... — Ed Morton, Jul 04 '21 at 17:46
@Ed Not many yet, and probably none in the future if no one starts watching this new tag. — oguz ismail, Jul 04 '21 at 17:52
@oguzismail from reading that MSO discussion I think the idea was to **add** a text processing tag, not replace the sed, awk, etc. tags with it. They refer to SE as an example of a forum that has such a tag and that forum still uses sed, awk, etc. too. — Ed Morton, Jul 04 '21 at 18:03
@EdMorton All OK to me as long as questions matching the description there don't have the [bash] tag, I'm tired of removing it all the time. — oguz ismail, Jul 04 '21 at 18:13
@oguzismail it's absolutely bizarre - we have a question on meta that isn't tagged with any of the impacted tags, with only 11 people voting on the question, then an accepted answer that only 12 people voted on with just one possible suggestion in it (ignoring other answers with more votes), and suddenly we have sweeping changes across SO driven by and approved by almost no-one who's actually impacted by this! I don't understand this at all. — Ed Morton, Jul 04 '21 at 18:29
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234503/discussion-between-oguz-ismail-and-ed-morton). — oguz ismail, Jul 04 '21 at 18:30

Ed Morton · Accepted Answer · 2021-07-04T17:04:07.033

$ cat tst.awk
BEGIN {
    FS="[[:space:]]*[][\"][[:space:]]*"
    OFS = ", "
}
{
    n = split($6,f,"[/?=]")
    print $2, f[n-2], f[n]
}

$ awk -f tst.awk file
02/Jul/2021:18:35:19 +0000, call-log, 5003
02/Jul/2021:20:35:19 +0000, resevation-log, 4003

The above uses the following way to split the input in your question into fields using any POSIX awk:

$ cat tst.awk
BEGIN {
    FS="[[:space:]]*[][\"][[:space:]]*"
    OFS = ","
}
{
    print
    for (i=1; i<=NF; i++) {
        print "\t" i, "<" $i ">"
    }
    print "-----"
}

$ awk -f tst.awk file
111.111.111.111 - - [02/Jul/2021:18:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/call-log?roomNo=5003" "Mozilla etc etc etc etc"
        1,<111.111.111.111 - ->
        2,<02/Jul/2021:18:35:19 +0000>
        3,<>
        4,<GET /api/items HTTP/2.0>
        5,<304 0>
        6,<https://example.com/some/text/call-log?roomNo=5003>
        7,<>
        8,<Mozilla etc etc etc etc>
        9,<>
-----
111.111.111.111 - - [02/Jul/2021:20:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/resevation-log?roomNo=4003" "Mozilla etc etc etc etc"
        1,<111.111.111.111 - ->
        2,<02/Jul/2021:20:35:19 +0000>
        3,<>
        4,<GET /api/items HTTP/2.0>
        5,<304 0>
        6,<https://example.com/some/text/resevation-log?roomNo=4003>
        7,<>
        8,<Mozilla etc etc etc etc>
        9,<>
-----

That would fail if any of your quoted fields can contain [, ], or an escaped ", none of which exist in your example but if they can happen then include them in the example in your question.

hey @EdMorton, can you give me a brief explanation about what is all with the `[]`s and `*`s in line `FS`? or just share a reference link. — Ahmet Said Akbulut, Jul 04 '21 at 21:18
Very interesting and usefull using `for` loop here to show every field according with the separator: it helps here and in other contexts, I think. — Carlos Pascual, Jul 05 '21 at 08:46
Ahmet a FS is a regexp so I'm just defining a regexp that's a bracket expression of `[][\"]` (i.e. any of the chars `]`, `[`, or `"` which are the ones in your text surrounding the strings you're interested in) optionally with spaces around them. Just look up FS in the awk manual for more info. — Ed Morton, Jul 05 '21 at 13:36
@CarlosPascual yeah, often just printing the contents of the fields like that is my first step in debugging any script. — Ed Morton, Jul 05 '21 at 13:39

Carlos Pascual · Answer 2 · 2021-07-04T17:46:07.560

3

This awk can extract the text:

awk -v FS='[][/?="]' -v OFS=',' '{print $2"/"$3"/"$4,$16,$18}' file
02/Jul/2021:18:35:19 +0000,call-log,5003
02/Jul/2021:20:35:19 +0000,resevation-log,4003

edited Jul 04 '21 at 17:46

answered Jul 04 '21 at 17:32

Carlos Pascual

1,106
1
5
8

ahmedazhar05 · Answer 3 · 2022-07-29T20:14:34.910

Another way of doing this using AWK is:

awk '{split($11, A, /\/+|"|(\?roomNo=)/); print substr($4, 2), substr($5, 1, 5) ",", A[6] ",", A[7]}' file.log >> newFile.log

First part is splitting the URL field into an array using regex,
then printing the specific fields and array values
Lastly storing the logs into another file named newFile.log

Edit:
And yet another shortest and fastest one-liner based on the log output above is using sed: (preferred)

sed -E 's/\].+\/|\?roomNo=/, /g; s/^.+\[|".+$//g' file.log >> newFile.log

where the first substitution replaces ] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/ and ?roomNo= with a , and the second substitution removes the first and last part which are 111.111.111.111 - - [ and " "Mozilla etc etc etc etc"

How to extract text from access log?

3 Answers3