After understating log concept I'm working on web request logs and try to match some keywords inside of string of those logs are their method is GET and I also reviewed pythonic log parsing and related post 1 and 2.
let's assume that I have a portion of a log line that looks something like this:
"GET /pas/public/ping?pas-client=007 HTTP/1.1"
the full log is:
[10/Oct/2021:05:45:20 +0000] SsAzyfdtrfdV1GKU7+Q== user_Zwfikdlo5CgOT0Loq8g== "GET /pas/public/ping?pas-client=007 HTTP/1.1" 200 "-b" 53b 5ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" qFhyYRqbqtsM7tA== 50001 - -
So I tried to use a regex can match with some part of API or query path (means ) before ? mark or (URL parameters). I want to check the events in spark dataframe which has single column raw
and if it was GET http method then match it with /ping?
. if it was the case I create column name Type and label it ping in from of that event as below and if it was not I label it POST:
raw | Type |
---|---|
[10/Oct/2021:05:45:20 +0000] SsAzyfdtrfdV1GKU7+Q== user_Zwfikdlo5CgOT0Loq8g== "GET /pas/public/ping?pas-client=007 HTTP/1.1" 200 "-b" 53b 5ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" qFhyYRqbqtsM7tA== 50001 - - | GET_ping |
# parsing df
sdf = (df
.withColumn('Type', F
.when(F.isnull('raw'), '-')
.when(F.regexp_extract('raw', '(?i)^/ping?$', 0) == F.col('raw'), 'ping') #To specify ping activities
So here my problem is similar to this post kind of multi condition regex problem. So how I can create right conditional regex which can match:
- with a word start with
"
following byspace
and it has 3 lettersG
E
T
- with a word start with
/
and ended up with?
and it has 4 lettersp
i
n
g