0

I want to structure my nginx logs which look like

ip - - [18/Dec/2016:06:44:41 +0300] "GET /some/part/thing HTTP/1.1" 200 4320 "https://referrer" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"

Currently, I'm reading each line of these logs, and with grep -E -o select details I need (ip, datetime, part, http_code, bandwidth, referrer).

But this is horribly slow with amount of logs I have. Is it possible to apply regexp per entire logfile with lookbehind etc, not per line?. I also was thinking of making this on GPUs.

UPDATE:

while IFS= read -r line || [ "$line" ]; do
        cnt=$((cnt+1))
        # we will match each information separating with first space,
        # then remove that info from line for the next info
        ip=,
        datetime=,
        slug=,
        http_code=,
        chunk_size=,
        referrer=,

        # Input 1. ip address
        ip=`echo $line | grep -E -o '^\S*'` || parse_error "ip address" $1 $cnt
        line=${line//$ip}

        # remove trash - -
        trash=`echo $line | grep -E -o '\-\s\-'` || parse_error "- - trash" $1 $cnt
        line=${line//$trash}

        # Input 2. datetime
        datetime=`echo $line | grep -E -o '[0-9]{2}/[A-Za-z]+/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}\s\+[0-9]{4}'` || parse_error "datetime" $1 $cnt
        line=${line//$datetime}

        # Input 3. slug
        slug=`echo $line | grep -o '[a-zA-Z0-9]*/mp4' | sed -e 's/\/mp4//'` || parse_error "stream slug" $1 $cnt

        # remove trash
        trash=`echo $line | grep -E -o '^.*HTTP/[0-2]{1}.[0-9]{1}"\s'` || parse_error "full HTTP GET req" $1 $cnt
        line=${line//$trash}

        # Input 4. http code
        http_code=`echo $line | grep -o '^\S*'` || parse_error "http code" $1 $cnt
        line=${line//$http_code}

        # this can be checked only here with regex above :(
        if [ $http_code != "200" ] && [ $http_code != "206" ]; then
            # continue to next line, skip this one, because only HTTP 200, 206 req are acceptable.
            continue
        fi

        # Input 5. http payload chunk size
        chunk_size=`echo $line | grep -o '^\S*'` || parse_error "chunk size" $1 $cnt
        line=${line//$chunk_size}

        # Input 6. Referrer
        referrer=`echo $line | grep -Po -m 1 '"\K[^"]*' | head -1` || parse_error "referrer" $1 $cnt

        # Handle cases when regex in Input 6 fails to match http referer, and match some trash instead
        if [[ $referrer != "http"* ]]; then
            referrer=""
        fi

        wait

        # printf "$ip,$datetime,$slug,$http_code,$chunk_size,%s\n" $referrer >> ./$3.csv || echo "[-] Can not write to .csv, something is bad at $cnt."
        string+="$ip,$datetime,$slug,$http_code,$chunk_size,$referrer\n"|| echo "[-] Can not put to string, something is bad at $cnt."
    done < "$1"
Novitoll
  • 820
  • 1
  • 9
  • 22
  • What is your exact expected output for your input line? – Inian Mar 15 '17 at 07:06
  • I'd say you're trying to use the wrong tool for the job if you want lookbehind etc. A more 'complete' text processing application would bring huge benefits - Perl would do the job hands-down. – FreudianSlip Mar 15 '17 at 07:13
  • @Inian, I would like to get these parts and write them to csv file. I guess, the problem is that I'm compiling regexp with grep per each line. Also, I should somehow use 1 regexp match with grouppings. Currently, I'm using grep per each part. – Novitoll Mar 15 '17 at 07:39
  • @Novitoll: Can you give some more sample lines to understand a common format of the logs, and can you use `Awk`? – Inian Mar 15 '17 at 07:40
  • @Inian, thanks, I've posted the code. It looks ugly for me, too much greps. Probably this should be done on C or Perl – Novitoll Mar 15 '17 at 07:48

0 Answers0