0

I have to extract the integers from URL in "Page URL" column and append those extracted integers into a new column with PySpark.

Here is my code:

def url_val(raw_url):
    params = raw_url.split('.com/')
    params = params[1].split('/')
    print(params)
    print("first_scenario")
    url_int = ''.join(x for x in raw_url if x.isdigit())
    return int(url_int)

url_val('https://www.crfashionbook.com/beauty/g28326016/crs-beauty-skincare-product-of-the-day')

The output is: 28326016, which is perfect but now I have to extract all the urls from the column "Page URL" and add those extracted integers into a new column. How would I do that? I have tried the following:

url_udf = udf(lambda x: url_val(x), IntegerType())
final_url_df = spark_df_url.filter(url_udf("Page URL"))

That raised Py4JJavaError.

I have also tried:

(
    spark_df_url.select('Page URL',
              url_udf('Page URL').alias('new_column'))
    .show()
) 

Gave me an error as well.

Chique_Code
  • 1,422
  • 3
  • 23
  • 49
  • 1
    If it's just digits, use [`regexp_extract`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract) instead of [using a `udf`](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance). `spark_df_url.withColumn("new_column", regexp_extract(col("Page URL"), "\d+", 1).cast("int")).show()` – pault Jan 22 '20 at 15:51
  • 1
    Thank you for your response @pault I have the following error: `TypeError: 'Column' object is not callable` – Chique_Code Jan 22 '20 at 15:55
  • make sure you do the appropriate imports: `from pyspark.sql.functions import col, regexp_extract` – pault Jan 22 '20 at 15:57
  • I had them imported, but the error is still there :( – Chique_Code Jan 22 '20 at 16:37
  • 1
    @pault may I ask you why did you close my question and mark it as duplicate? Nowhere on Stack Overflow, I found the question answered. Also, the possible duplicate used another language, which is not helpful at all. – Chique_Code Jan 22 '20 at 17:26
  • 1
    The answer is this: ```spark_df_url.withColumn("new_column", regexp_extract("Page URL"), "\d+", 0).show()``` – Chique_Code Jan 22 '20 at 18:45
  • Oops I put in a 1 where it should have been a 0. Still a duplicate though. Please don't ask the same question again. If you disagree with the duplicate closure, [edit] the question to explain WHY the duplicate doesn't work for you and include a [mcve]. – pault Jan 22 '20 at 19:15
  • 1
    I changed the dupe reason. Now, it points to the answer. – Wiktor Stribiżew Jan 23 '20 at 08:15

0 Answers0