I need to extract the HOST from millions of URLs. Some of the URLs are not formed well and return NULL. In many cases I see braces ({}
) or pipes (|
) causing issues, other times I see multiple hash (#
) characters causing issues.
Here is my code containing URLs that I need to parse:
val b = Seq(
("https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}"),
("https://example.com/test.aspx?names=John|Peter"),
("https://example.com/#/test.aspx?help=John#top"),
("https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12"),
).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT parse_url(`url_col`, 'HOST') as HOST, url_col from temp").show(false)
Expected output:
+-----------+------------------------------------------------------------------------+
|HOST |url_col |
+-----------+------------------------------------------------------------------------+
|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|example.com|https://example.com/test.aspx?names=John|Peter |
|example.com|https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
+-----------+------------------------------------------------------------------------+
Current output:
+-----------+------------------------------------------------------------------------+
|HOST |url_col |
+-----------+------------------------------------------------------------------------+
|null |https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|null |https://example.com/test.aspx?names=John|Peter |
|null |https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
+-----------+------------------------------------------------------------------------+
Is there a way to force parse_url to return the host when the URL contains invalid characters or is mal-formed? Or is there a better way?