-1

I have log file. More or less they look like this. I want to clean them a bit and get right order as it was real link.

Wondering if someone knows how to write a regex in py(spark) to get desried output.

1: 

https%3A%2F%2Fwww.btv.com%2Fnews%2Ffinland%2Fartikel%2F5174938%2Fzwemmer-zoekactie-julianadorp-kinderen-gered

Desired Output 

https://www.btv.com/news/finland/artikel/5174938/zwemmer-zoekactie-julianadorp-kinderen-gered


2: 
https%3A%2F%2Fwww.weather.com%2F

Desired Output 
https://www.weather.com


3:
https%3A%2F%2Fwww.weather.com%2Ffinland%2Fneerslag%2Fweather%2F3uurs

Desired Output 
https://www.weather.com/finland/neerslag/ weather /uurs

I have tried couple of soltuions but without much of understanding.

    \b\w+\b(?!\/)


   from pyspark.sql.functions import regexp_extract, col
   regexp_extract(column_name, regex, group_number)
   regex('(.)(by)(\s+)(\w+)')  

Thanks in advance

James Taylor
  • 484
  • 1
  • 8
  • 23
  • 2
    Does this answer your question? [Url decode UTF-8 in Python](https://stackoverflow.com/questions/16566069/url-decode-utf-8-in-python) – Christian Baumann Oct 01 '20 at 09:39
  • @ChristianBaumann sort of helpful. hower i also came across something `android-app%3A%2F%2Fcom.google.android.googlequicksearchbox%2Fhttps%2Fwww.google.com` – James Taylor Oct 01 '20 at 14:01

1 Answers1

1

You can use urlib.parse.unqoute and you will have to make a udf to use it with pyspark.

from urllib.parse import unquote
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
df = spark.createDataFrame([['https%3A%2F%2Fwww.btv.com%2Fnews%2Ffinland%2Fartikel%2F5174938%2Fzwemmer-zoekactie-julianadorp-kinderen-gered'],
                            ['https%3A%2F%2Fwww.weather.com%2F'],
                            ['https%3A%2F%2Fwww.weather.com%2Ffinland%2Fneerslag%2Fweather%2F3uurs']],['url'])

urldecode_udf = udf(lambda x:unquote(x) , StringType())
df = df.withColumn("decodedurl",urldecode_udf(df.url))
df.select('decodedurl').show(3,False)

Output:

+---------------------------------------------------------------------------------------------+
|decodedurl                                                                                   |
+---------------------------------------------------------------------------------------------+
|https://www.btv.com/news/finland/artikel/5174938/zwemmer-zoekactie-julianadorp-kinderen-gered|
|https://www.weather.com/                                                                     |
|https://www.weather.com/finland/neerslag/weather/3uurs                                       |
+---------------------------------------------------------------------------------------------+
Equinox
  • 6,483
  • 3
  • 23
  • 32