Questions tagged [pandas-udf]
41 questions
3
votes
1 answer
Pyspark Pandas-Vectorized UDFs
I am trying to convert this udf into this pandas udf, in order to avoid creating two pandas udfs.
Convert this:
@udf("string")
def splitEmailUDF(email: str, position: int) -> str:
return email.split("@")[position]
into this in one pandas udf…

Susy84
- 104
- 6
3
votes
0 answers
Spark Apply In Pandas - How it works and how to tune
I have millions of sentences I want to encode with a model from sentence transformers (which is a pytorch model). https://www.sbert.net/
I am planning to use pyspark and an apply in pandas function.…

B_Miner
- 1,840
- 4
- 31
- 66
2
votes
1 answer
Use Pandas UDF to calculate Cosine Similarity of two vectors in PySpark
I want to calculate the cosine similarity of 2 vectors using Pandas UDF. I implemented it with Spark UDF, which works fine with the following script.
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
#…

Haritha Thilakarathne
- 880
- 8
- 19
2
votes
0 answers
How to use applyInPandas inside a class method with pyspark
I have a class with a native python function (performing some imputations on a pd df) that will be used on grouped data with applyInPandas (https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html) in…

Feary
- 37
- 6
2
votes
1 answer
Pandas UDF with dictionary lookup and conditionals
I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs.
An example function looks something like below:
def…

Tarique
- 463
- 3
- 16
2
votes
3 answers
Geopandas convert crs
I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert this.how i can execute this in distributed manner.
…

code_bug
- 355
- 1
- 12
2
votes
2 answers
Apply wordninja.split() using pandas_udf
I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja.split('culturetosuccess') outputs…

Elm662
- 663
- 1
- 5
- 18
2
votes
2 answers
Parallelize MLflow Project runs with Pandas UDF on Azure Databricks Spark
I'm trying to parallelize the training of multiple time-series using Spark on Azure Databricks.
Other than training, I would like to log metrics and models using MLflow.
The structure of the code is quite simple (basically adapted this example).
A…

Matteo Zantedeschi
- 35
- 5
1
vote
1 answer
python udf iterator -> iterator giving outputted more rows error
Have dataframe with text column CALL_TRANSCRIPT (string format) and pii_allmethods column (array of string). Trying to search Call_Transcripts for strings in array & mask using pyspark pandas udf. Getting outputted more than input rows errors. Tried…

Mohan Rayapuvari
- 289
- 1
- 4
- 18
1
vote
1 answer
Pyspark - Pandas UDF using Cosine Similarity - Setting an array element with a sequence
Here is my schema:
root
|-- embedding_init: array (nullable = true)
| |-- element: double (containsNull = true)
|-- embeddings: array (nullable = false)
| |-- element: array (containsNull = false)
| | |-- element: double…

madst
- 93
- 5
1
vote
1 answer
Azure Databrickd:- PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's;
Env : Azure Databricks
Cluster : 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)
I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error.
PythonException: 'RuntimeError: The length of output in Scalar…

Ancil Pa
- 21
- 3
1
vote
1 answer
Error in pandas_udf with the vector expected 1, got 2
I'm trying to get the country name with latitude and longitude as input, so I used the Nominatim API and when I pass as a UDF it works, but when I try to use pandas_udf get the following error:
An exception was thrown from a UDF: 'RuntimeError:…

BryC
- 89
- 6
1
vote
2 answers
Using pandas udf without looping in pyspark
So suppose I have a big spark dataframe .I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it's ok to loop inside on all columns
But I dont…

Barushkish
- 69
- 2
- 9
1
vote
1 answer
Converting apply from pandas to a pandas_udf
How can I convert the following sample code to a pandas_udf:
def calculate_courses_final_df(this_row):
some code that applies to each row of the data
df_contracts_courses.apply(lambda x: calculate_courses_final_df(x),…

Matt
- 85
- 6
1
vote
0 answers
How to use a @pandas_udf function inside a class with pyspark?
I am trying to use one of the Hugging Face models with ML flow. My input is a pyspark DataFrame.
The issue is Mlflow doesn't support directly HuggingFace models, so need to use the flavor pyfunc to save it. So I need create a Python class that…

Anna_v
- 11
- 1