Highest Voted 'pandas-udf' Questions

3

votes

1 answer

Pyspark Pandas-Vectorized UDFs

I am trying to convert this udf into this pandas udf, in order to avoid creating two pandas udfs. Convert this: @udf("string") def splitEmailUDF(email: str, position: int) -> str: return email.split("@")[position] into this in one pandas udf…

asked May 11 '23 at 21:27

Susy84

104
6

3

votes

0 answers

Spark Apply In Pandas - How it works and how to tune

I have millions of sentences I want to encode with a model from sentence transformers (which is a pytorch model). https://www.sbert.net/ I am planning to use pyspark and an apply in pandas function.…

apache-spark pyspark pandas-udf

asked Nov 05 '22 at 16:48

B_Miner

1,840
4
31
66

2

votes

1 answer

Use Pandas UDF to calculate Cosine Similarity of two vectors in PySpark

I want to calculate the cosine similarity of 2 vectors using Pandas UDF. I implemented it with Spark UDF, which works fine with the following script. import numpy as np from pyspark.sql.functions import udf from pyspark.sql.types import FloatType #…

python pandas apache-spark pyspark pandas-udf

asked May 03 '23 at 09:20

Haritha Thilakarathne

880
8
19

2

votes

0 answers

How to use applyInPandas inside a class method with pyspark

I have a class with a native python function (performing some imputations on a pd df) that will be used on grouped data with applyInPandas (https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html) in…

python apache-spark class pyspark pandas-udf

asked Mar 29 '23 at 11:12

Feary

37
6

2

votes

1 answer

Pandas UDF with dictionary lookup and conditionals

I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs. An example function looks something like below: def…

apache-spark pyspark pyspark-pandas pandas-udf

asked Sep 16 '22 at 10:50

Tarique

463
3
16

2

votes

3 answers

Geopandas convert crs

I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert this.how i can execute this in distributed manner. …

pyspark geospatial geopandas pyproj pandas-udf

asked Sep 02 '22 at 12:22

code_bug

355
1
12

2

votes

2 answers

Apply wordninja.split() using pandas_udf

I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja: E.g. wordninja.split('culturetosuccess') outputs…

apache-spark pyspark split user-defined-functions pandas-udf

asked Aug 05 '22 at 12:12

Elm662

663
1
5
18

2

votes

2 answers

Parallelize MLflow Project runs with Pandas UDF on Azure Databricks Spark

I'm trying to parallelize the training of multiple time-series using Spark on Azure Databricks. Other than training, I would like to log metrics and models using MLflow. The structure of the code is quite simple (basically adapted this example). A…

apache-spark pyspark azure-databricks mlflow pandas-udf

asked Mar 14 '22 at 20:50

Matteo Zantedeschi

35
5

1

vote

1 answer

python udf iterator -> iterator giving outputted more rows error

Have dataframe with text column CALL_TRANSCRIPT (string format) and pii_allmethods column (array of string). Trying to search Call_Transcripts for strings in array & mask using pyspark pandas udf. Getting outputted more than input rows errors. Tried…

python pyspark pandas-udf

asked Mar 26 '23 at 04:57

Mohan Rayapuvari

289
1
4
18

1

vote

1 answer

Pyspark - Pandas UDF using Cosine Similarity - Setting an array element with a sequence

python pyspark pandas-udf

asked Feb 03 '23 at 22:53

madst

93
5

1

vote

1 answer

Azure Databrickd:- PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's;

Env : Azure Databricks Cluster : 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12) I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error. PythonException: 'RuntimeError: The length of output in Scalar…

azure pyspark databricks user-defined-functions pandas-udf

asked Jan 19 '23 at 07:38

Ancil Pa

21
3

1

vote

1 answer

Error in pandas_udf with the vector expected 1, got 2

I'm trying to get the country name with latitude and longitude as input, so I used the Nominatim API and when I pass as a UDF it works, but when I try to use pandas_udf get the following error: An exception was thrown from a UDF: 'RuntimeError:…

python apache-spark pyspark nominatim pandas-udf

asked Jan 15 '23 at 05:47

BryC

89
6

1

vote

2 answers

Using pandas udf without looping in pyspark

So suppose I have a big spark dataframe .I dont know how many columns. (the solution has to be in pyspark using pandas udf. Not a different approach) I want to perform an action on all columns. So it's ok to loop inside on all columns But I dont…

pyspark pandas-udf

asked Nov 22 '22 at 06:51

Barushkish

69
2
9

1

vote

1 answer

Converting apply from pandas to a pandas_udf

How can I convert the following sample code to a pandas_udf: def calculate_courses_final_df(this_row): some code that applies to each row of the data df_contracts_courses.apply(lambda x: calculate_courses_final_df(x),…

apache-spark pyspark user-defined-functions pandas-udf

asked Oct 17 '22 at 03:26

Matt

85
6

1

vote

0 answers

How to use a @pandas_udf function inside a class with pyspark?

I am trying to use one of the Hugging Face models with ML flow. My input is a pyspark DataFrame. The issue is Mlflow doesn't support directly HuggingFace models, so need to use the flavor pyfunc to save it. So I need create a Python class that…

dataframe class pyspark mlflow pandas-udf

asked Sep 21 '22 at 21:09

Anna_v

11
1

Questions tagged [pandas-udf]