Highest Voted 'pyspark' Questions

327

votes

25 answers

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in…

python apache-spark pyspark apache-spark-sql rename

asked Dec 03 '15 at 22:21

Shubhanshu Mishra

6,210
6
21
23

215

votes

1 answer

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very…

scala performance apache-spark pyspark rdd

asked Sep 08 '15 at 17:46

Mrityunjay

2,211
3
14
8

205

votes

3 answers

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column',…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 25 '15 at 18:17

Evan Zamir

8,059
14
56
83

199

votes

14 answers

Show distinct column values in pyspark dataframe

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique(). I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need…

python apache-spark pyspark apache-spark-sql

asked Sep 08 '16 at 06:03

Satya

5,470
17
47
72

185

votes

16 answers

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I cannot for the life of me figure out how to stop all…

python scala apache-spark hadoop pyspark

asked Aug 07 '14 at 22:48

horatio1701d

8,809
14
48
77

181

votes

11 answers

How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform to RDD new_col = pd.DataFrame(randomed_hours,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 12 '15 at 21:14

Boris

2,005
2
11
10

172

votes

11 answers

Filter Pyspark dataframe column with None value

I'm trying to filter a PySpark dataframe that has None as a row value: df.select('dt_mvmt').distinct().collect() [Row(dt_mvmt=u'2016-03-27'), Row(dt_mvmt=u'2016-03-28'), Row(dt_mvmt=u'2016-03-29'), Row(dt_mvmt=None), …

python apache-spark dataframe pyspark apache-spark-sql

asked May 16 '16 at 20:31

Ivan

19,560
31
97
141

171

votes

11 answers

Convert spark DataFrame column to python list

I work on a dataframe with two column, mvv and count. +---+-----+ |mvv|count| +---+-----+ | 1 | 5 | | 2 | 9 | | 3 | 3 | | 4 | 1 | i would like to obtain two list containing mvv values and count value. Something like mvv = [1,2,3,4] count =…

python apache-spark pyspark apache-spark-sql

asked Jul 27 '16 at 10:36

a.moussa

2,977
7
34
56

163

votes

18 answers

How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that? PS: I want to check if it's empty so that I only save the DataFrame if it's not empty

apache-spark pyspark apache-spark-sql

asked Sep 22 '15 at 02:52

auxdx

2,313
3
21
25

158

votes

5 answers

How to find the size or shape of a DataFrame in PySpark?

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this. In Python, I can do this: data.shape() Is there a similar function in PySpark? This is my current solution, but I am looking for an…

python dataframe pyspark

asked Sep 23 '16 at 04:42

Xi Liang

1,649
3
10
5

156

votes

7 answers

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark. Following is the way, I did: toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType()) changedTypedf =…

python apache-spark dataframe pyspark apache-spark-sql

asked Aug 29 '15 at 09:34

Abhishek Choudhary

8,255
19
69
128

152

votes

9 answers

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint,…

apache-spark apache-spark-sql pyspark

asked Apr 13 '15 at 08:10

xjx0524

1,531
2
10
5

145

votes

12 answers

Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: [ Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 18 '15 at 11:16

resec

2,091
3
13
22

139

votes

8 answers

Sort in descending order in PySpark

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code. group_by_dataframe.count().filter("`count` >= 10").sort('count',…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 29 '15 at 15:57

rclakmal

1,872
3
17
19

136

votes

5 answers

How to kill a running Spark application?

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource. I did some quick research and people suggested using YARN kill or /bin/spark-class to kill the command. However, I am…

apache-spark hadoop-yarn pyspark

asked Apr 10 '15 at 15:51

B.Mr.W.

18,910
35
114
178

Questions tagged [pyspark]

Useful Links:

Related Tags:

How to change dataframe column names in PySpark?

Spark performance for Scala vs Python

How to add a constant column in a Spark DataFrame?

Show distinct column values in pyspark dataframe

How to turn off INFO logging in Spark?

How do I add a new column to a Spark DataFrame (using PySpark)?

Filter Pyspark dataframe column with None value

Convert spark DataFrame column to python list

How to check if spark dataframe is empty?

How to find the size or shape of a DataFrame in PySpark?

How to change a dataframe column from String type to Double type in PySpark?

How to delete columns in pyspark dataframe

Spark Dataframe distinguish columns with duplicated name

Sort in descending order in PySpark

How to kill a running Spark application?