I'm a newbie in PySpark and I want to exclude/remove the URL parameters from the raw weblog that existed in the spark dataframe. The nature of data is the following:
+----------------------------------------------------------------------------------------+
|weblog |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== |
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== |
|"HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000] |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp |
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection? |
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03- |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |
+----------------------------------------------------------------------------------------+
so I want to remove whatever immediately after ?
as below:
+----------------------------------------------------------------------------------------+
|this part should be removed from weblog |
+----------------------------------------------------------------------------------------+
|xxxx-clientt=005 |
+----------------------------------------------------------------------------------------+
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03- |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
+----------------------------------------------------------------------------------------+
My expected output is like this:
+----------------------------------------------------------------------------------------+
|weblog |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== |
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== |
|"HEAD /xxxx/pub/ping? HTTP/1.1" 200 "-b" 53b 7ms |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000] |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp |
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection? |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |
+----------------------------------------------------------------------------------------+
So I tried to find a quick and safe way inspired by this post but I couldn't adapt if you see the colab notebook at end of this question as I tried:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query=""))
I tried following way unsuccessfully and sadly couldn't exclude desired part from the rest and clean it:
from pyspark.sql.functions import udf
from urllib.parse import urlsplit
schema2 = StructType(
[
StructField("path", StringType(), False),
StructField("query", ArrayType(StringType(), False), True),
StructField("fragment", StringType(), True),
]
)
def _parse_url(s):
data = urlsplit(s)
if data[3]:
query_params = list()
query_params.append(data[3])
else:
query_params = None
return {
"path": "{}://{}/{}".format(data[0], data[1].rstrip("/"), data[2]),
"query": query_params,
"fragment": data[4],
}
url_parse_udf = f.udf(_parse_url, schema2)
parsed = sdf.select("*", url_parse_udf(sdf["weblog"]).alias("data"))
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|col #|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - #|
#|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The problem with my attempt is it returns everything after ?
in the raw weblog. I provided the colab notebook for quick debugging. I also was thinking if there is a mechanism to parse the weblog and extract URL parameters and then subtract two columns from each other like:
sdf1 = sdf.withColumn('Result', ( sdf['weblog'] - sdf['url_parameters'] ))
weblog | url_parameters | Results (weblog - url_parameters) |
---|---|---|
03/Oct/2021:09:26:37 +0000...xxxx-clientt=005... | xxxx-clientt=005 | ... |
03/Oct/2021:00:19:24 +0000...$format=json&$... | $format=json&$... | ... |