Questions tagged [pandas-udf]

41 questions
3
votes
1 answer

Pyspark Pandas-Vectorized UDFs

I am trying to convert this udf into this pandas udf, in order to avoid creating two pandas udfs. Convert this: @udf("string") def splitEmailUDF(email: str, position: int) -> str: return email.split("@")[position] into this in one pandas udf…
3
votes
0 answers

Spark Apply In Pandas - How it works and how to tune

I have millions of sentences I want to encode with a model from sentence transformers (which is a pytorch model). https://www.sbert.net/ I am planning to use pyspark and an apply in pandas function.…
B_Miner
  • 1,840
  • 4
  • 31
  • 66
2
votes
1 answer

Use Pandas UDF to calculate Cosine Similarity of two vectors in PySpark

I want to calculate the cosine similarity of 2 vectors using Pandas UDF. I implemented it with Spark UDF, which works fine with the following script. import numpy as np from pyspark.sql.functions import udf from pyspark.sql.types import FloatType #…
2
votes
0 answers

How to use applyInPandas inside a class method with pyspark

I have a class with a native python function (performing some imputations on a pd df) that will be used on grouped data with applyInPandas (https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html) in…
Feary
  • 37
  • 6
2
votes
1 answer

Pandas UDF with dictionary lookup and conditionals

I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs. An example function looks something like below: def…
Tarique
  • 463
  • 3
  • 16
2
votes
3 answers

Geopandas convert crs

I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert this.how i can execute this in distributed manner. …
code_bug
  • 355
  • 1
  • 12
2
votes
2 answers

Apply wordninja.split() using pandas_udf

I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja: E.g. wordninja.split('culturetosuccess') outputs…
Elm662
  • 663
  • 1
  • 5
  • 18
2
votes
2 answers

Parallelize MLflow Project runs with Pandas UDF on Azure Databricks Spark

I'm trying to parallelize the training of multiple time-series using Spark on Azure Databricks. Other than training, I would like to log metrics and models using MLflow. The structure of the code is quite simple (basically adapted this example). A…
1
vote
1 answer

python udf iterator -> iterator giving outputted more rows error

Have dataframe with text column CALL_TRANSCRIPT (string format) and pii_allmethods column (array of string). Trying to search Call_Transcripts for strings in array & mask using pyspark pandas udf. Getting outputted more than input rows errors. Tried…
Mohan Rayapuvari
  • 289
  • 1
  • 4
  • 18
1
vote
1 answer

Pyspark - Pandas UDF using Cosine Similarity - Setting an array element with a sequence

Here is my schema: root |-- embedding_init: array (nullable = true) | |-- element: double (containsNull = true) |-- embeddings: array (nullable = false) | |-- element: array (containsNull = false) | | |-- element: double…
madst
  • 93
  • 5
1
vote
1 answer

Azure Databrickd:- PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's;

Env : Azure Databricks Cluster : 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12) I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error. PythonException: 'RuntimeError: The length of output in Scalar…
1
vote
1 answer

Error in pandas_udf with the vector expected 1, got 2

I'm trying to get the country name with latitude and longitude as input, so I used the Nominatim API and when I pass as a UDF it works, but when I try to use pandas_udf get the following error: An exception was thrown from a UDF: 'RuntimeError:…
BryC
  • 89
  • 6
1
vote
2 answers

Using pandas udf without looping in pyspark

So suppose I have a big spark dataframe .I dont know how many columns. (the solution has to be in pyspark using pandas udf. Not a different approach) I want to perform an action on all columns. So it's ok to loop inside on all columns But I dont…
Barushkish
  • 69
  • 2
  • 9
1
vote
1 answer

Converting apply from pandas to a pandas_udf

How can I convert the following sample code to a pandas_udf: def calculate_courses_final_df(this_row): some code that applies to each row of the data df_contracts_courses.apply(lambda x: calculate_courses_final_df(x),…
1
vote
0 answers

How to use a @pandas_udf function inside a class with pyspark?

I am trying to use one of the Hugging Face models with ML flow. My input is a pyspark DataFrame. The issue is Mlflow doesn't support directly HuggingFace models, so need to use the flavor pyfunc to save it. So I need create a Python class that…
Anna_v
  • 11
  • 1
1
2 3