Questions tagged [pyspark-pandas]

131 questions
4
votes
1 answer

Pandas-on-spark throwing java.lang.StackOverFlowError

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production workload on our spark cluster, and therefore…
Psychotechnopath
  • 2,471
  • 5
  • 26
  • 47
3
votes
2 answers

Update a specific value when 2 other values matches from 2 different tables in PySpark

Any idea how to write this in PySpark? I have two PySpark DataFrames that i'm trying to union. However, there is 1 value that I want to update based on 2 duplicate column values. PyDf1: +-----------+-----------+-----------+------------+ |test_date …
Mick
  • 265
  • 2
  • 10
3
votes
4 answers

Create column using Spark pandas_udf, with dynamic number of input columns

I have this df: df = spark.createDataFrame( [('row_a', 5.0, 0.0, 11.0), ('row_b', 3394.0, 0.0, 4543.0), ('row_c', 136111.0, 0.0, 219255.0), ('row_d', 0.0, 0.0, 0.0), ('row_e', 0.0, 0.0, 0.0), ('row_f', 42.0, 0.0,…
3
votes
2 answers

How to filter pyspark dataframe with last 14 days?

I am having a date column in my dataframe I wanted to filter out the last 14 days from the dataframe using the date column. I tried the below code but it's not working last_14 = df.filter((df('Date')> date_add(current_timestamp(),…
sthambi
  • 197
  • 2
  • 17
2
votes
1 answer

How do I run a function that applies regex iteratively in pandas-on-spark API?

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production workload on our spark cluster, and therefore…
Psychotechnopath
  • 2,471
  • 5
  • 26
  • 47
2
votes
1 answer

use applyInPandas with PySpark on a cluster

The applyInPandas method can be used to apply a function in parallel to a GroupedData pyspark object as in the minimal example below. import pandas as pd from time import sleep from pyspark.sql import SparkSession # spark session object spark =…
Russell Burdt
  • 2,391
  • 2
  • 19
  • 30
2
votes
1 answer

Pandas UDF with dictionary lookup and conditionals

I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs. An example function looks something like below: def…
Tarique
  • 463
  • 3
  • 16
2
votes
2 answers

How to add a column based on a function to Pandas on Spark DataFrame?

I would like to run udf on Pandas on Spark dataframe. I thought it should be easy but having tough time figuring it out. For example, consider my psdf (Pandas Spark DataFrame) name p1 p2 0 AAA 1.0 1.0 1 BBB 1.0 …
Selva
  • 951
  • 7
  • 23
2
votes
0 answers

Is there in library in Apache PySpark to convert html to pdf?

I'm trying to use a PySpark notebook in Microsoft Azure Synapse to convert an HTML string to a pdf. I have found multiple library’s such as "weasyprint", "wkhtmltopdf", "wkhtml2pdf", and "pdfkit" that work in python but aren't available in…
Reece
  • 21
  • 1
2
votes
0 answers

ImportError: Pandas >= 0.23.2 must be installed; however, it was not found. / pyspark/pandas are not properly imported in Apache Spark 3.2.1

I have Apache Spark 3.2.1 docker container running and got the below code. 3.2.1 version includes pandas. So I have changed the import line as "from pyspark import pandas as ps" but still I am getting the error …
suj
  • 507
  • 1
  • 8
  • 22
2
votes
1 answer

Pie chart for pyspark.pandas.frame.DataFrame

How do generate the same pie chart for pyspark.pandas.frame.DataFrame? I'm not able to get the legend right. piefreq=final_psdf['Target'].value_counts() piefreq.plot.pie() For pandas.core.frame.DataFrame, I managed to produce my desired pie chart…
gracenz
  • 137
  • 1
  • 10
2
votes
1 answer

TypeError: Datetime subtraction can only be applied to datetime series

I am trying to replace pandas with pyspark.pandas library, when I tried this : pdf is a pyspark.pandas dataframe pdf["date_diff"] = pdf["date1"] - pdf["date2"] I got the below error : File…
user19930511
  • 299
  • 2
  • 15
2
votes
2 answers

How to save empty pyspark dataframe with header into csv file?

Hi I have dataframe which is having only columns. There is no data for columns. But I am trying to save into file, no header is saving. File is totally…
Shivika
  • 209
  • 3
  • 15
1
vote
1 answer

get median of a columns based on the weights from another column

I have a data frame like this, col1 col2 100 3 200 2 300 4 400 1 Now I want to have median on col1 in such way col2 values will be the weights for each col1 values like this, median of [100, 100, 100, 200, 200, 300, 300,…
Kallol
  • 2,089
  • 3
  • 18
  • 33
1
vote
1 answer

Pandas to Pyspark conversion (repeat/explode)

I’m trying to take a notebook that I’ve written in Python/Pandas and modify/convert it to use Pyspark. The dataset I’m working with is (as real world datasets often are) complete and utter garbage, and so some of the things I have to do to it are…
1
2 3
8 9