How can I split each line of a log file into SQL columns? (maybe using a regex for splitting)

Question

I have a log file which I have to include it into QlikSense. QlikSense reads line to line of the log file so I need an expresion for splitting this line into the desired columns.

Log file looks like (its size is about 2.5 millions of entries):

202.32.92.47 - - [01/Jun/1995:00:00:59 -0600] "GET /~scottp/publish.html" 200 271 - -
ix-or7-27.ix.netcom.com RFC-1413 - [01/Jun/1995:00:02:51 -0600] "GET /~ladd/ostriches.html" 200 205908 - "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" 
ppp-4.pbmo.net - John Thomas [07/Dec/1995:13:20:28 -0600] "GET /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0" 500 - "http://www.wikipedia.org/" "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" 
ppp-4.pbmo.net - John Thomas [07/Dec/1995:13:20:37 -0600] "GET /dcs/courses/cai/html/index.html HTTP/1.0" 500 4528 - - 
lbm2.niddk.nih.gov RFC-1413 John Thomas [07/Dec/1995:13:21:03 -0600] "GET /~ladd/vet_libraries.html" 200 11337 "http://www.wikipedia.org/" -

The struct of each line of this log file is: IP ID NAME DATETIME TIMEZONE METHOD DIR STATUS MB WEB FROM. So, I will split previous log example using || for a better visualization:

|| ix-or7-27.ix.netcom.com || RFC-1413 || - || [01/Jun/1995:00:02:51 || -0600] "GET || /~ladd/ostriches.html" || 200 || 205908 || - || "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" ||
|| ppp-4.pbmo.net || - || John Thomas || [07/Dec/1995:13:20:28 || -0600] || "GET || /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0" || 500 || - || "http://www.wikipedia.org/" || "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" ||
|| ppp-4.pbmo.net || - || John Thomas || [07/Dec/1995:13:20:37 || -0600] || "GET || /dcs/courses/cai/html/index.html HTTP/1.0" || 500 || 4528 || - || - ||
|| lbm2.niddk.nih.gov || RFC-1413 || John Thomas || [07/Dec/1995:13:21:03 || -0600] || "GET || /~ladd/vet_libraries.html" || 200 || 11337 || "http://www.wikipedia.org/" || - ||

So, for example, for the first line:

IP = ix-or7-27.ix.netcom.com 
ID = RFC-1413 
NAME = - 
DATETIME = 01/Jun/1995 00:02:51 
TIMEZONE = -0600 
METHOD = GET 
DIR: /~ladd/ostriches.html
STATUS = 200 
MB = 205908 
WEB = -
FROM = Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)

So, each field's value can be text or -. I have trying many ways of including it but I do not achieve that.

I have tryed splitting each line using space separator, but this not work due to each line can have different number of spaces. Also using -,... But I did not get it work due to data lenght is variable.

I have tought that maybe doing a RegEx (a pattern) maybe could solve my problem, but I have not get experience in patterns and I do not know how could I do that.

EDIT 1:

If the solution to my problem is a regex pattern should do next:

First parameter: catch all up to space
Second parameter: catch all up to space
Third parameter: catch all up to [
Fourth parameter: catch all up to space
Fifth parameter: catch all up to ]
Sixth parameter: catch all up to space
Seventh parameter: catch all up to space
Eigth parameter: catch all up to space
Nineth parameter: catch all up to space
Tenth parameter: catch all inside "" or -
Eleventh parameter: catch all inside "" or -

Any idea how could I got it?

Thank you.

There are some packages that do this kind of task. Look for them. — Rick James, Dec 13 '18 at 23:25
can you say me an example of these packages? I do not know any — JuMoGar, Dec 14 '18 at 09:38
This might be a use case best suited to one of the data integration tools out there like Talend or others. In those tools you could build an integration job that would connect to your data source (i.e. log file), parse each line using regex, then deliver the parsed output to the output format of your choice (i.e. csv, sql, etc.) where it could then be picked up by QlikSense. Alternatively you might have a look at the QlikSense Regex connector https://help.qlik.com/en-US/connectors/Subsystems/Web_Connectors_help/Content/Connectors_QWC/Data-Source-Connectors/RegEx-Connector.htm — BRM, Jan 09 '19 at 22:38

score 0 · Answer 1 · answered Dec 12 '18 at 22:10

I once had to parse multiple 36GB log files with varying lengths (after splitting on space). Tried RegExp and it worked, but here is very different. You might just have to do line.split(" ").length then check the count, and then do your logic based on that.

    PrintWriter out=new PrintWriter("/directory/log.txt"),errorsOut=new PrintWriter("/directory/log-errors.txt");
    for(String line:lines){
      try{
        if(line.split(" ").length==11){
            String result=line[0]+"|"+line[1]+"|"+line[2]+"|"+line[3]replace("[", "").replaceFirst(":", " ")+"|"+...(etc)...
            out.println(line);
      }catch(Exception e) {
          errorsOut.println(line);
      }
        } else if(line.split(" ").length==14) { ... }
    }

May not be the most efficient but for 2.5MB it's not gonna die, it will catch a lot, and if there are errors you can write them out into a separate file to check later.

I've also tried logstash and other Enterprise log viewers. Some are good but most do not offer a "catch-all" solution.

Thank you @esca2791. First: I have been seeing your code. It split each line on spaces, and then will not accept all lines due to for examen will recognoice John Thomas as two parameters instead of one. Second: I can not modify log file, nor create other, I must use original log file. — JuMoGar, Dec 13 '18 at 11:58
Also, I have tested your code (removing `...` ,`...(etc)...` and fixing other erros) I have not achieve make it work — JuMoGar, Dec 13 '18 at 12:39

score 0 · Answer 2 · answered Dec 12 '18 at 22:22

0

Inspired by this SO answer, you can try the following regex and then sanitize it by removing trailing " and [] chars.

(.*?)\s(?=(?:[^"]*"[^"]*")*[^"]*\Z)

If you want regex only solution, you can try plugging into this, in order to remove those trailing chars. I'd suggest using https://regex101.com/

P.S. Group 1 of this regex contains all the data pieces you'd like.

answered Dec 12 '18 at 22:22

LIvanov

1,126
12
30

Thank you, @Llvanov. But seeing that I have two question: First: It is posible that John Thomas is taken as two separated values instead of one? Second: How can I export that to each column? (I mean, that RegEx match with all fields, but I do not know how can I assign each one to each column). – JuMoGar Dec 13 '18 at 08:23

score 0 · Answer 3 · answered Jan 15 '19 at 01:47

0

Just use SubField function, https://help.qlik.com/en-US/sense/November2018/Subsystems/Hub/Content/Sense_Hub/Scripting/StringFunctions/SubField.htm

example:

LOAD

text

, subfield (text , '||' , 1) as 1_parameter

, subfield (text , '||' , 2) as 2_parameter

etc...

answered Jan 15 '19 at 01:47

Chernov

532
3
4

Thank you, but `||` only was for a better visualization. Original file has not contain ´||´ characters, so I can not use `subfield (text , '||' , 1)`. Original file has the format shown after `EDIT 1` – JuMoGar Jan 15 '19 at 08:23

How can I split each line of a log file into SQL columns? (maybe using a regex for splitting)

3 Answers3