Highest Voted 'pyspark-pandas' Questions

4

votes

1 answer

Pandas-on-spark throwing java.lang.StackOverFlowError

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production workload on our spark cluster, and therefore…

asked May 03 '23 at 12:28

Psychotechnopath

2,471
5
26
47

3

votes

2 answers

Update a specific value when 2 other values matches from 2 different tables in PySpark

Any idea how to write this in PySpark? I have two PySpark DataFrames that i'm trying to union. However, there is 1 value that I want to update based on 2 duplicate column values. PyDf1: +-----------+-----------+-----------+------------+ |test_date …

pyspark apache-spark-sql pyspark-pandas pyspark-schema

asked Sep 26 '22 at 16:52

Mick

265
2
10

3

votes

4 answers

Create column using Spark pandas_udf, with dynamic number of input columns

I have this df: df = spark.createDataFrame( [('row_a', 5.0, 0.0, 11.0), ('row_b', 3394.0, 0.0, 4543.0), ('row_c', 136111.0, 0.0, 219255.0), ('row_d', 0.0, 0.0, 0.0), ('row_e', 0.0, 0.0, 0.0), ('row_f', 42.0, 0.0,…

apache-spark pyspark apache-spark-sql user-defined-functions pyspark-pandas

asked Apr 05 '22 at 11:19

ZygD

22,092
39
79
102

3

votes

2 answers

How to filter pyspark dataframe with last 14 days?

I am having a date column in my dataframe I wanted to filter out the last 14 days from the dataframe using the date column. I tried the below code but it's not working last_14 = df.filter((df('Date')> date_add(current_timestamp(),…

pyspark pyspark-pandas

asked Mar 10 '22 at 19:18

sthambi

197
2
17

2

votes

1 answer

How do I run a function that applies regex iteratively in pandas-on-spark API?

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production workload on our spark cluster, and therefore…

python apache-spark pyspark-pandas

asked May 10 '23 at 13:13

Psychotechnopath

2,471
5
26
47

2

votes

1 answer

use applyInPandas with PySpark on a cluster

The applyInPandas method can be used to apply a function in parallel to a GroupedData pyspark object as in the minimal example below. import pandas as pd from time import sleep from pyspark.sql import SparkSession # spark session object spark =…

apache-spark pyspark amazon-emr pyarrow pyspark-pandas

asked Oct 10 '22 at 21:47

Russell Burdt

2,391
2
19
30

2

votes

1 answer

Pandas UDF with dictionary lookup and conditionals

I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs. An example function looks something like below: def…

apache-spark pyspark pyspark-pandas pandas-udf

asked Sep 16 '22 at 10:50

Tarique

463
3
16

2

votes

2 answers

How to add a column based on a function to Pandas on Spark DataFrame?

I would like to run udf on Pandas on Spark dataframe. I thought it should be easy but having tough time figuring it out. For example, consider my psdf (Pandas Spark DataFrame) name p1 p2 0 AAA 1.0 1.0 1 BBB 1.0 …

python pyspark pyspark-pandas

asked Sep 15 '22 at 20:00

Selva

951
7
23

2

votes

0 answers

Is there in library in Apache PySpark to convert html to pdf?

I'm trying to use a PySpark notebook in Microsoft Azure Synapse to convert an HTML string to a pdf. I have found multiple library’s such as "weasyprint", "wkhtmltopdf", "wkhtml2pdf", and "pdfkit" that work in python but aren't available in…

html pdf pyspark pdf-generation pyspark-pandas

asked Sep 04 '22 at 15:16

Reece

21
1

2

votes

0 answers

ImportError: Pandas >= 0.23.2 must be installed; however, it was not found. / pyspark/pandas are not properly imported in Apache Spark 3.2.1

I have Apache Spark 3.2.1 docker container running and got the below code. 3.2.1 version includes pandas. So I have changed the import line as "from pyspark import pandas as ps" but still I am getting the error …

pandas apache-spark pyspark pyspark-pandas apache-spark-3.0

asked Jul 12 '22 at 07:54

suj

507
1
8
22

2

votes

1 answer

Pie chart for pyspark.pandas.frame.DataFrame

How do generate the same pie chart for pyspark.pandas.frame.DataFrame? I'm not able to get the legend right. piefreq=final_psdf['Target'].value_counts() piefreq.plot.pie() For pandas.core.frame.DataFrame, I managed to produce my desired pie chart…

python apache-spark pyspark pie-chart pyspark-pandas

asked Jun 25 '22 at 06:26

gracenz

137
1
10

2

votes

1 answer

TypeError: Datetime subtraction can only be applied to datetime series

I am trying to replace pandas with pyspark.pandas library, when I tried this : pdf is a pyspark.pandas dataframe pdf["date_diff"] = pdf["date1"] - pdf["date2"] I got the below error : File…

python pandas apache-spark pyspark-pandas

asked Mar 22 '22 at 20:47

user19930511

299
2
15

2

votes

2 answers

How to save empty pyspark dataframe with header into csv file?

Hi I have dataframe which is having only columns. There is no data for columns. But I am trying to save into file, no header is saving. File is totally…

pyspark apache-spark-sql pyspark-pandas

asked Mar 08 '22 at 17:20

Shivika

209
3
15

1

vote

1 answer

get median of a columns based on the weights from another column

I have a data frame like this, col1 col2 100 3 200 2 300 4 400 1 Now I want to have median on col1 in such way col2 values will be the weights for each col1 values like this, median of [100, 100, 100, 200, 200, 300, 300,…

python pandas dataframe pyspark pyspark-pandas

asked Aug 08 '23 at 06:29

Kallol

2,089
3
18
33

1

vote

1 answer

Pandas to Pyspark conversion (repeat/explode)

I’m trying to take a notebook that I’ve written in Python/Pandas and modify/convert it to use Pyspark. The dataset I’m working with is (as real world datasets often are) complete and utter garbage, and so some of the things I have to do to it are…

python pandas dataframe pyspark pyspark-pandas

asked Mar 31 '23 at 14:49

snakeeyes021

53
4

Questions tagged [pyspark-pandas]