19

I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.

I wanted to split it in many small files, by using date as splitting criteria.

Date is in format [15/Oct/2011:12:02:02 +0000]. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?

Input file name is access.log. I'd like output files to have format such as access.apache.15_Oct_2011.log (that would do the trick, although not nice when sorting.)

StackzOfZtuff
  • 2,534
  • 1
  • 28
  • 25
mr.b
  • 4,932
  • 11
  • 38
  • 55

7 Answers7

24

One way using awk:

awk 'BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = sprintf("%02d", a)
}
{
    split($4,array,"[:/]")
    year = array[3]
    month = m[array[2]]

    print > FILENAME"-"year"_"month".txt"
}' incendiary.ws-2009

This will output files like:

incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt

Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.

Original inspiration: "How to split existing apache logfile by month?"

Marián Černý
  • 15,096
  • 4
  • 70
  • 83
Theodore R. Smith
  • 21,848
  • 12
  • 65
  • 91
  • You are right, Sir. I've just tested perl solution as well, and awk solution was faster by 3x. I suspect it has to do with the fact that awk example doesn't use regular expressions but simple string splitting, which might be more efficient. Marking as an Accepted answer. – mr.b Jul 30 '12 at 10:56
  • Oh, and I'm definitely using this on production against 20 GB files with no problems now. Takes about 2 GB/minute on my system. – Theodore R. Smith Jul 30 '12 at 12:22
  • Similar performances here as well: ~1 minute / ~2.5gb. Thanks! – mr.b Jul 30 '12 at 14:20
  • The only thing is, I need date extracted as well - my daily log sizes are well over 400MB these days.. Could you modify the script to include dates as well? – mr.b Jul 30 '12 at 14:22
  • Shouldn't 'split($4,array,"[:/]");' instruction should come before 'year = array[3]' – Ceki Apr 15 '13 at 06:35
  • @TheodoreR.Smith Your output file name are wrong because you encoded the `month` variable with two digits `sprintf("%02d", a)`. Can you please fix your output file names so as to avoid confusion ? – SebMa Aug 19 '22 at 10:32
10

Pure bash, making one pass through the access log:

while read; do
    [[ $REPLY =~ \[(..)/(...)/(....): ]]

    d=${BASH_REMATCH[1]}
    m=${BASH_REMATCH[2]}
    y=${BASH_REMATCH[3]}

    #printf -v fname "access.apache.%s_%s_%s.log" ${BASH_REMATCH[@]:1:3}
    printf -v fname "access.apache.%s_%s_%s.log" $y $m $d

    echo "$REPLY" >> $fname
done < access.log
chepner
  • 497,756
  • 71
  • 530
  • 681
  • 5
    The method in my answer is dramatically faster: Against a 150 MB log file, this answer took **70 seconds** on an 3.4 GHz 8 Core Xeon E31270, while the method in mine took **5 seconds**. – Theodore R. Smith Jul 30 '12 at 01:18
  • However, this answer creates logs files on a day basis not on a monthly basis. This does less, no wonder it is faster. – i.am.michiel Dec 04 '15 at 08:39
  • @i.am.michiel The reason this is slower is that iterating through the input is much faster in `awk` than in `bash`; the number of output files is not really relevant. – chepner Dec 04 '15 at 12:32
5

Here is an awk version that outputs lexically sortable log files.

Some efficiency enhancements: all done in one pass, only generate fname when it is not the same as before, close fname when switching to a new file (otherwise you might run out of file descriptors).

awk -F"[]/:[]" '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  if($4 != pyear || $3 != pmonth || $2 != pday) {
    pyear  = $4
    pmonth = $3
    pday   = $2

    if(fname != "")
      close(fname)

    fname  = sprintf("access_%04d_%02d_%02d.log", $4, m2n[$3], $2)
  }
  print > fname
}' access-log
Thor
  • 45,082
  • 11
  • 119
  • 130
4

Perl came to the rescue:

cat access.log | perl -n -e'm@\[(\d{1,2})/(\w{3})/(\d{4}):@; open(LOG, ">>access.apache.$3_$2_$1.log"); print LOG $_;'

Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.

I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.

mr.b
  • 4,932
  • 11
  • 38
  • 55
4

I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.

awk '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  split($4, a, "[]/:[]")
  if(a[4] != pyear || a[3] != pmonth || a[2] != pday) {
    pyear  = a[4]
    pmonth = a[3]
    pday   = a[2]

    if(fname != "")
      close(fname)

    fname  = sprintf("access_%04d-%02d-%02d.log", a[4], m2n[a[3]], a[2])
  }
  print >> fname
}'
jwadsack
  • 5,708
  • 2
  • 40
  • 50
1

Kind of ugly, that's bash for you:

    for year in 2010 2011 2012; do
       for month in jan feb mar apr may jun jul aug sep oct nov dec; do
           for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
               cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
            done
        done
     done
ncultra
  • 306
  • 1
  • 5
  • very clever, thanks ;) this would work great for small file (filesize less than amount of ram), as it loops through entire file about 1,116 times :) – mr.b Jul 27 '12 at 12:55
  • 2
    very true, its not an efficient script. it would be good for occasional use. Thanks! – ncultra Jul 27 '12 at 13:16
  • 1
    it would be faster to unroll the outer loop and process the file in two passes - on the first pass split the file into entries by year. The second pass would then process each year file and split the entries by date. It may even be faster to unroll the second loop and process the file in three passes. – ncultra Jul 27 '12 at 13:31
  • grepping for the date will accidentally delete stacktraces etc, i.e. any lines that don't contain a date will be deleted. Usually it is these lines that are the most interesting. – nby Mar 26 '19 at 11:49
1

I made a slight improvement to Theodore's answer so I could see progress when processing a very large log file.

#!/usr/bin/awk -f

BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = a
}
{
    split($4, array, "[:/]")
    year = array[3]
    month = sprintf("%02d", m[array[2]])

    current = year "-" month
    if (last != current)
        print current
    last = current

    print >> FILENAME "-" year "-" month ".txt"
}

Also I found that I needed to use gawk (brew install gawk if you don't have it) for this to work on Mac OS X.

simon
  • 933
  • 1
  • 9
  • 17