I have to extract the integers from URL in "Page URL" column and append those extracted integers into a new column with PySpark.
Here is my code:
def url_val(raw_url):
params = raw_url.split('.com/')
params = params[1].split('/')
print(params)
print("first_scenario")
url_int = ''.join(x for x in raw_url if x.isdigit())
return int(url_int)
url_val('https://www.crfashionbook.com/beauty/g28326016/crs-beauty-skincare-product-of-the-day')
The output is: 28326016, which is perfect but now I have to extract all the urls from the column "Page URL" and add those extracted integers into a new column. How would I do that? I have tried the following:
url_udf = udf(lambda x: url_val(x), IntegerType())
final_url_df = spark_df_url.filter(url_udf("Page URL"))
That raised Py4JJavaError.
I have also tried:
(
spark_df_url.select('Page URL',
url_udf('Page URL').alias('new_column'))
.show()
)
Gave me an error as well.