0

After understating log concept I'm working on web request logs and try to match some keywords inside of string of those logs are their method is GET and I also reviewed pythonic log parsing and related post 1 and 2.

let's assume that I have a portion of a log line that looks something like this:

"GET /pas/public/ping?pas-client=007 HTTP/1.1"

the full log is:

[10/Oct/2021:05:45:20 +0000] SsAzyfdtrfdV1GKU7+Q== user_Zwfikdlo5CgOT0Loq8g== "GET /pas/public/ping?pas-client=007 HTTP/1.1" 200 "-b" 53b 5ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" qFhyYRqbqtsM7tA== 50001 - -

So I tried to use a regex can match with some part of API or query path (means ) before ? mark or (URL parameters). I want to check the events in spark dataframe which has single column raw and if it was GET http method then match it with /ping?. if it was the case I create column name Type and label it ping in from of that event as below and if it was not I label it POST:

raw Type
[10/Oct/2021:05:45:20 +0000] SsAzyfdtrfdV1GKU7+Q== user_Zwfikdlo5CgOT0Loq8g== "GET /pas/public/ping?pas-client=007 HTTP/1.1" 200 "-b" 53b 5ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" qFhyYRqbqtsM7tA== 50001 - - GET_ping
# parsing df
sdf = (df  
    .withColumn('Type', F
        .when(F.isnull('raw'), '-')
        .when(F.regexp_extract('raw', '(?i)^/ping?$', 0) == F.col('raw'), 'ping') #To specify ping activities

So here my problem is similar to this post kind of multi condition regex problem. So how I can create right conditional regex which can match:

  • with a word start with " following by space and it has 3 letters G E T
  • with a word start with / and ended up with ? and it has 4 letters p i n g
Mario
  • 1,631
  • 2
  • 21
  • 51

1 Answers1

0

Your regex seems a bit off. This is my version. You should at least escape the backslashes \ and the question mark ? after ping.

(df
    .withColumn('type', F
        .when(F.isnull('raw'), '-')
        .when(F.regexp_extract('raw', '(GET).*(\/ping)\?', 0).isNotNull(), 'GET_ping')
        .otherwise('POST')
    )
    .select('type')
    .show()
)
pltc
  • 5,836
  • 1
  • 13
  • 31