Questions tagged [pyspark-pandas]
131 questions
4
votes
1 answer
Pandas-on-spark throwing java.lang.StackOverFlowError
I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production workload on our spark cluster, and therefore…

Psychotechnopath
- 2,471
- 5
- 26
- 47
3
votes
2 answers
Update a specific value when 2 other values matches from 2 different tables in PySpark
Any idea how to write this in PySpark?
I have two PySpark DataFrames that i'm trying to union. However, there is 1 value that I want to update based on 2 duplicate column values.
PyDf1:
+-----------+-----------+-----------+------------+
|test_date …

Mick
- 265
- 2
- 10
3
votes
4 answers
Create column using Spark pandas_udf, with dynamic number of input columns
I have this df:
df = spark.createDataFrame(
[('row_a', 5.0, 0.0, 11.0),
('row_b', 3394.0, 0.0, 4543.0),
('row_c', 136111.0, 0.0, 219255.0),
('row_d', 0.0, 0.0, 0.0),
('row_e', 0.0, 0.0, 0.0),
('row_f', 42.0, 0.0,…

ZygD
- 22,092
- 39
- 79
- 102
3
votes
2 answers
How to filter pyspark dataframe with last 14 days?
I am having a date column in my dataframe
I wanted to filter out the last 14 days from the dataframe using the date column.
I tried the below code but it's not working
last_14 = df.filter((df('Date')> date_add(current_timestamp(),…

sthambi
- 197
- 2
- 17
2
votes
1 answer
How do I run a function that applies regex iteratively in pandas-on-spark API?
I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production workload on our spark cluster, and therefore…

Psychotechnopath
- 2,471
- 5
- 26
- 47
2
votes
1 answer
use applyInPandas with PySpark on a cluster
The applyInPandas method can be used to apply a function in parallel to a GroupedData pyspark object as in the minimal example below.
import pandas as pd
from time import sleep
from pyspark.sql import SparkSession
# spark session object
spark =…

Russell Burdt
- 2,391
- 2
- 19
- 30
2
votes
1 answer
Pandas UDF with dictionary lookup and conditionals
I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs.
An example function looks something like below:
def…

Tarique
- 463
- 3
- 16
2
votes
2 answers
How to add a column based on a function to Pandas on Spark DataFrame?
I would like to run udf on Pandas on Spark dataframe. I thought it should be easy but having tough time figuring it out.
For example, consider my psdf (Pandas Spark DataFrame)
name p1 p2
0 AAA 1.0 1.0
1 BBB 1.0 …

Selva
- 951
- 7
- 23
2
votes
0 answers
Is there in library in Apache PySpark to convert html to pdf?
I'm trying to use a PySpark notebook in Microsoft Azure Synapse to convert an HTML string to a pdf. I have found multiple library’s such as "weasyprint", "wkhtmltopdf", "wkhtml2pdf", and "pdfkit" that work in python but aren't available in…

Reece
- 21
- 1
2
votes
0 answers
ImportError: Pandas >= 0.23.2 must be installed; however, it was not found. / pyspark/pandas are not properly imported in Apache Spark 3.2.1
I have Apache Spark 3.2.1 docker container running and got the below code. 3.2.1 version includes pandas. So I have changed the import line as "from pyspark import pandas as ps" but still I am getting the error
…

suj
- 507
- 1
- 8
- 22
2
votes
1 answer
Pie chart for pyspark.pandas.frame.DataFrame
How do generate the same pie chart for pyspark.pandas.frame.DataFrame?
I'm not able to get the legend right.
piefreq=final_psdf['Target'].value_counts()
piefreq.plot.pie()
For pandas.core.frame.DataFrame, I managed to produce my desired pie chart…

gracenz
- 137
- 1
- 10
2
votes
1 answer
TypeError: Datetime subtraction can only be applied to datetime series
I am trying to replace pandas with pyspark.pandas library, when I tried this :
pdf is a pyspark.pandas dataframe
pdf["date_diff"] = pdf["date1"] - pdf["date2"]
I got the below error :
File…

user19930511
- 299
- 2
- 15
2
votes
2 answers
How to save empty pyspark dataframe with header into csv file?
Hi I have dataframe which is having only columns. There is no data for columns. But I am trying to save into file, no header is saving. File is totally…

Shivika
- 209
- 3
- 15
1
vote
1 answer
get median of a columns based on the weights from another column
I have a data frame like this,
col1 col2
100 3
200 2
300 4
400 1
Now I want to have median on col1 in such way col2 values will be the weights for each col1 values like this,
median of [100, 100, 100, 200, 200, 300, 300,…

Kallol
- 2,089
- 3
- 18
- 33
1
vote
1 answer
Pandas to Pyspark conversion (repeat/explode)
I’m trying to take a notebook that I’ve written in Python/Pandas and modify/convert it to use Pyspark. The dataset I’m working with is (as real world datasets often are) complete and utter garbage, and so some of the things I have to do to it are…

snakeeyes021
- 53
- 4