I have log file. More or less they look like this. I want to clean them a bit and get right order as it was real link.
Wondering if someone knows how to write a regex in py(spark) to get desried output.
1:
https%3A%2F%2Fwww.btv.com%2Fnews%2Ffinland%2Fartikel%2F5174938%2Fzwemmer-zoekactie-julianadorp-kinderen-gered
Desired Output
https://www.btv.com/news/finland/artikel/5174938/zwemmer-zoekactie-julianadorp-kinderen-gered
2:
https%3A%2F%2Fwww.weather.com%2F
Desired Output
https://www.weather.com
3:
https%3A%2F%2Fwww.weather.com%2Ffinland%2Fneerslag%2Fweather%2F3uurs
Desired Output
https://www.weather.com/finland/neerslag/ weather /uurs
I have tried couple of soltuions but without much of understanding.
\b\w+\b(?!\/)
from pyspark.sql.functions import regexp_extract, col
regexp_extract(column_name, regex, group_number)
regex('(.)(by)(\s+)(\w+)')
Thanks in advance