0

I'm a newbie in PySpark and I want to exclude/remove the URL parameters from the raw weblog that existed in the spark dataframe. The nature of data is the following:

+----------------------------------------------------------------------------------------+
|weblog                                                                                  |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000]                                                            |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q==               |     
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g==              |
|"HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms                         |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"                                    |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                        |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000]                                                            |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp                   | 
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?                 |
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03-                   |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |                                                                               
+----------------------------------------------------------------------------------------+

so I want to remove whatever immediately after ? as below:

+----------------------------------------------------------------------------------------+
|this part should be removed from weblog                                                 |                      
+----------------------------------------------------------------------------------------+
|xxxx-clientt=005                                                                        |
+----------------------------------------------------------------------------------------+
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03-                   |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |                                       
+----------------------------------------------------------------------------------------+

My expected output is like this:

+----------------------------------------------------------------------------------------+
|weblog                                                                                  |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000]                                                            |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q==               |     
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g==              |
|"HEAD /xxxx/pub/ping? HTTP/1.1" 200 "-b" 53b 7ms                                        |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"                                    |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                        |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000]                                                            |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp                   | 
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?                 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |                                                                               
+----------------------------------------------------------------------------------------+

So I tried to find a quick and safe way inspired by this post but I couldn't adapt if you see the colab notebook at end of this question as I tried:

from urllib.parse import urlsplit, urlunsplit

def remove_query_params_and_fragment(url):
    return urlunsplit(urlsplit(url)._replace(query=""))

I tried following way unsuccessfully and sadly couldn't exclude desired part from the rest and clean it:

from pyspark.sql.functions import udf
from urllib.parse import urlsplit

schema2 = StructType(
    [
        StructField("path", StringType(), False),
        StructField("query", ArrayType(StringType(), False), True),
        StructField("fragment", StringType(), True),
    ]
)


def _parse_url(s):
    data = urlsplit(s)
    if data[3]:
        query_params = list()
        query_params.append(data[3])
    else:
        query_params = None
    return {
        "path": "{}://{}/{}".format(data[0], data[1].rstrip("/"), data[2]),
        "query": query_params,
        "fragment": data[4],
    }


url_parse_udf = f.udf(_parse_url, schema2)

parsed = sdf.select("*", url_parse_udf(sdf["weblog"]).alias("data"))

#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|col                                                                                                                                                                                                                                                 #|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    #|
#|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The problem with my attempt is it returns everything after ? in the raw weblog. I provided the colab notebook for quick debugging. I also was thinking if there is a mechanism to parse the weblog and extract URL parameters and then subtract two columns from each other like:

sdf1 = sdf.withColumn('Result', ( sdf['weblog'] - sdf['url_parameters'] ))
weblog url_parameters Results (weblog - url_parameters)
03/Oct/2021:09:26:37 +0000...xxxx-clientt=005... xxxx-clientt=005 ...
03/Oct/2021:00:19:24 +0000...$format=json&$... $format=json&$... ...
Mario
  • 1,631
  • 2
  • 21
  • 51

1 Answers1

0

Avoid using UDF if possible. UDF is like a black box to pyspark and thus spark cannot efficiently apply optimizations on them. For details please read this.

Rather than using Udfs, you can directly use pyspark's sql functions.


from pyspark.sql.functions import split
# from urllib.parse import urlsplit
split_with_question_mark = split(sdf.weblog, '\\?')
param_separated_df = sdf.withColumn("before_param", split_with_question_mark[0]).withColumn("after_param", split_with_question_mark[1])
param_separated_df.show(truncate=False)

Result:


+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog                                                                                                                                                                                                                                                                                                                                                                                                                        |before_param                                                                                                                                                                                          |after_param                                                                                                                                                                                                                                         |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                       |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection                             |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Once you have separated the before Query Url, You can split the after query part by Http method type i.e HTTP/1.1 to get the query parameters.

import pyspark.sql.functions as func

separated_by_comma = param_separated_df.withColumn("query_param", func.split(param_separated_df["after_param"], 'HTTP/1.1')[0]);
separated_by_comma.show(truncate=False)

Result:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog                                                                                                                                                                                                                                                                                                                                                                                                                        |before_param                                                                                                                                                                                          |after_param                                                                                                                                                                                                                                         |query_param                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                       |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    |xxxx-client=005                                                                                                                                              |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection                             |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+

All the above changes are made in collab that you shared.

Drashti Dobariya
  • 2,455
  • 2
  • 10
  • 23