-1

I'm trying to format text logs to csv file text log file format. Each entry starting with prefixes ("t=%m p=%p h=%h db=%d u=%u x=%x") which continue to the next prefixed line consider as one row. It may contain \n and \r escape sequences.

t=2020-08-25 15:00:00.000 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.011 ms  execute S_40: SELECT ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID FROM DB_LOG WHERE (ID = $1)
t=2020-08-25 15:00:00.000 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0 DETAIL:  parameters: $1 = '9187372'
t=2020-08-25 15:00:00.001 +03 p=36001 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.005 ms  bind S_1: COMMIT
t=2020-08-25 15:00:00.001 +03 p=36001 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.004 ms  execute S_1: COMMIT
t=2020-08-25 15:00:00.001 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.018 ms  bind S_41: INSERT INTO DB_LOG (ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14)
t=2019-12-19 17:00:00.102 +03 p=58042 h= db= u= x=0 LOG:  automatic vacuum of table "postgres.pgagent.pga_job": index scans: 0
    pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen
    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983
    buffer usage: 90 hits, 0 misses, 0 dirtied
    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

After prefixes coming SQL statements, as usual they are inconstant.

if it possible it would be perfect without prefixes, each row should be formatted like below :

"2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","LOG:"," duration: 0.011 ms  execute S_40: SELECT ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID FROM DB_LOG WHERE (ID = $1)"
"2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","DETAIL:"," parameters: $1 = '9187372'"
"2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG:"," duration: 0.005 ms  bind S_1: COMMIT"
"2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG:"," duration: 0.004 ms  execute S_1: COMMIT"
"2020-08-25 15:00:00.001 +03","16205","127.0.0.1","test","test_app","0","LOG:"," duration: 0.018 ms  bind S_41: INSERT INTO DB_LOG (ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14)"
"2019-12-19 17:00:00.102 +03","58042","","","","0","LOG:"," automatic vacuum of table "postgres.pgagent.pga_job": index scans: 0pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983    buffer usage: 90 hits, 0 misses, 0 dirtied    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s"

regex101: https://regex101.com/r/R3vADD/4

but I'm not sure this last part of expected line will occur some issues when copy csv file to db because of "table" has double quotes.

" automatic vacuum of table "postgres.pgagent.pga_job": index scans: 0pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983    buffer usage: 90 hits, 0 misses, 0 dirtied    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s"

thanks all.

2 Answers2

2

With GNU awk for FPAT, the 3rg arg to match() and \s/\S shorthand for [[:space:]] and [^[:space:]]:

$ cat tst.awk
BEGIN {
    FPAT = "[[:alnum:]]+=[^=]* "
    OFS = ","
}
/^\S/ { if (NR>1) prt() }
{ prev = prev $0 }
END { prt() }

function prt(   orig, i, a) {
    orig = $0
    $0 = prev

    match($0,/(.* )(LOG|DETAIL): +(.*)/,a)

    $0 = a[1]
    $(NF+1) = a[2]
    $(NF+1) = a[3]

    for (i=1; i<=NF; i++) {
        gsub(/^\s+|\s+$/,"",$i)
        sub(/^\S+=/,"",$i)
        gsub(/"/,"\"\"",$i)
        printf "\"%s\"%s", $i, (i<NF ? OFS : ORS)
    }

    $0 = orig
    prev = ""
}

.

$ awk -f tst.awk file
"2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","LOG","duration: 0.011 ms  execute S_40: SELECT ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID FROM DB_LOG WHERE (ID = $1)"
"2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","DETAIL","parameters: $1 = '9187372'"
"2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG","duration: 0.005 ms  bind S_1: COMMIT"
"2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG","duration: 0.004 ms  execute S_1: COMMIT"
"2020-08-25 15:00:00.001 +03","16205","127.0.0.1","test","test_app","0","LOG","duration: 0.018 ms  bind S_41: INSERT INTO DB_LOG (ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14)"
"2019-12-19 17:00:00.102 +03","58042","","","","0","LOG","automatic vacuum of table ""postgres.pgagent.pga_job"": index scans: 0    pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983    buffer usage: 90 hits, 0 misses, 0 dirtied    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s"

The last line of expected output in your question contains " automatic vacuum of table "postgres.pgagent.pga_job": index ..." but that isn't valid CSV since you can't have unescaped double quotes inside a string that's double-quoted. It'd have to be " automatic vacuum of table ""postgres.pgagent.pga_job"": index ..." or " automatic vacuum of table \"postgres.pgagent.pga_job\": index ..." (depending on which escaping construct is used in whichever "standard", see What's the most robust way to efficiently parse CSV using awk?, is adopted by whichever tool you're going to read it with) to be valid CSV. I decided to use "" for that case in my script above since that's what MS-Excel would expect but it'd be a trivial tweak to use \" instead if that's what you need - just change gsub(/"/,"\"\"",$i) to gsub(/"/,"\\\"",$i).

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    nice use of gawk functionality! One Q tho: why do you need ```FPAT``` if you use ```match``` on ```$0``` and then overwrite ```$0```? I see the results are diffrent if comment out the FPAT, but I just don't understand why. Care to explain? – vgersh99 Aug 28 '20 at 14:45
  • @vgersh99 I'm using match() to separate $0 into 3 parts - the part before LOG|DETAIL where the FPAT will be used, then LOG or DETAIL, then the part after it. Does that make sense or is it still fuzzy? – Ed Morton Aug 28 '20 at 14:47
  • THAT I understand perfectly well. The Q is: why you need to define FPAT if you're only match-ing on $0 (which doesn't do any field splitting)? and then you're overwritting $0... – vgersh99 Aug 28 '20 at 14:49
  • 1
    When I do `$0 = a[1]` after the call to match(), $0 has the value `t=2020-08-25 15:00:00.000 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0` and field splitting using FPAT occurs when I do that assignment setting `$1` to `t=2020-08-25 15:00:00.000 +03`, and `$2` to `p=16205`, etc. – Ed Morton Aug 28 '20 at 14:51
  • Thanks a lot @EdMorton, can you explain row by row or refer me the documantation which related to the each line? – Багир Гварамадзе Aug 31 '20 at 11:33
  • No, sorry, this forum is to help people with their coding, not do it all for them, so while I don't mind sometimes posting a solution for people who haven't provided their own attempt in their question, I'm not going to use it as an introductory tutorial to the language by explaining every line. The code is very simple once you understand the basics of awk so lease look up the constructs at https://www.gnu.org/software/gawk/manual/gawk.html, google a bit, add some "print" statetments to see what's happening where, and feel free to ask if you have any specific questions after that. – Ed Morton Aug 31 '20 at 17:01
1

Here you go: https://regex101.com/r/R3vADD/1

^t=(.* .*) p=(\d+)? h=(.*)? db=(\w+)? u=(\w+)? x=(\d+)? (\w+:) (.*)

will match the groups, and you can replace them like so:

"\1","\2","\3","\4","\5","\6","\7","\8"

Example in CLI with Perl:

cat file.csv|perl -pe 's/^t=(.* .*) p=(\d+) h=(.*) db=(\w+) u=(\w+) x=(\d+) (\w+:) (.*)/"\1","\2","\3","\4","\5","\6","\7","\8"/g'
Ron
  • 5,900
  • 2
  • 20
  • 30