0

I need to extract the HOST from millions of URLs. Some of the URLs are not formed well and return NULL. In many cases I see braces ({}) or pipes (|) causing issues, other times I see multiple hash (#) characters causing issues.

Here is my code containing URLs that I need to parse:

val b = Seq(
    ("https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}"),
    ("https://example.com/test.aspx?names=John|Peter"),
    ("https://example.com/#/test.aspx?help=John#top"),
    ("https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12"),
    ).toDF("url_col")

b.createOrReplaceTempView("temp")
spark.sql("SELECT parse_url(`url_col`, 'HOST') as HOST, url_col from temp").show(false)

Expected output:

+-----------+------------------------------------------------------------------------+
|HOST       |url_col                                                                 |
+-----------+------------------------------------------------------------------------+
|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|example.com|https://example.com/test.aspx?names=John|Peter                          |
|example.com|https://example.com/#/test.aspx?help=John#top                           |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12  |
+-----------+------------------------------------------------------------------------+

Current output:

+-----------+------------------------------------------------------------------------+
|HOST       |url_col                                                                 |
+-----------+------------------------------------------------------------------------+
|null       |https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|null       |https://example.com/test.aspx?names=John|Peter                          |
|null       |https://example.com/#/test.aspx?help=John#top                           |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12  |
+-----------+------------------------------------------------------------------------+

Is there a way to force parse_url to return the host when the URL contains invalid characters or is mal-formed? Or is there a better way?

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Wonko the Sane
  • 754
  • 3
  • 14
  • 31

1 Answers1

1

You can extract the domain using regexp_extract function (example of regex):

spark.sql("""
    SELECT  regexp_extract(url_col, "^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www.)?([^:\/\n?]+)", 1) as HOST, 
            url_col 
    FROM  temp
""").show(false)

//+-----------+------------------------------------------------------------------------+
//|HOST       |url_col                                                                 |
//+-----------+------------------------------------------------------------------------+
//|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
//|example.com|https://example.com/test.aspx?names=John|Peter                          |
//|example.com|https://example.com/#/test.aspx?help=John#top                           |
//|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12  |
//+-----------+------------------------------------------------------------------------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76