Adding extracted integers form url into a new column with PySpark

Question

I have to extract the integers from URL in "Page URL" column and append those extracted integers into a new column with PySpark.

Here is my code:

def url_val(raw_url):
    params = raw_url.split('.com/')
    params = params[1].split('/')
    print(params)
    print("first_scenario")
    url_int = ''.join(x for x in raw_url if x.isdigit())
    return int(url_int)

url_val('https://www.crfashionbook.com/beauty/g28326016/crs-beauty-skincare-product-of-the-day')

The output is: 28326016, which is perfect but now I have to extract all the urls from the column "Page URL" and add those extracted integers into a new column. How would I do that? I have tried the following:

url_udf = udf(lambda x: url_val(x), IntegerType())
final_url_df = spark_df_url.filter(url_udf("Page URL"))

That raised Py4JJavaError.

I have also tried:

(
    spark_df_url.select('Page URL',
              url_udf('Page URL').alias('new_column'))
    .show()
)

Gave me an error as well.

If it's just digits, use [`regexp_extract`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract) instead of [using a `udf`](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance). `spark_df_url.withColumn("new_column", regexp_extract(col("Page URL"), "\d+", 1).cast("int")).show()` — pault, Jan 22 '20 at 15:51
Thank you for your response @pault I have the following error: `TypeError: 'Column' object is not callable` — Chique_Code, Jan 22 '20 at 15:55
make sure you do the appropriate imports: `from pyspark.sql.functions import col, regexp_extract` — pault, Jan 22 '20 at 15:57
@pault may I ask you why did you close my question and mark it as duplicate? Nowhere on Stack Overflow, I found the question answered. Also, the possible duplicate used another language, which is not helpful at all. — Chique_Code, Jan 22 '20 at 17:26
The answer is this: ```spark_df_url.withColumn("new_column", regexp_extract("Page URL"), "\d+", 0).show()``` — Chique_Code, Jan 22 '20 at 18:45
Oops I put in a 1 where it should have been a 0. Still a duplicate though. Please don't ask the same question again. If you disagree with the duplicate closure, [edit] the question to explain WHY the duplicate doesn't work for you and include a [mcve]. — pault, Jan 22 '20 at 19:15

Adding extracted integers form url into a new column with PySpark

0 Answers0

Linked