Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

340

votes

14 answers

Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset

asked Jul 20 '15 at 02:31

oikonomiyaki

7,691
15
62
101

327

votes

25 answers

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in…

python apache-spark pyspark apache-spark-sql rename

asked Dec 03 '15 at 22:21

Shubhanshu Mishra

6,210
6
21
23

205

votes

3 answers

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column',…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 25 '15 at 18:17

Evan Zamir

8,059
14
56
83

199

votes

14 answers

Show distinct column values in pyspark dataframe

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique(). I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need…

python apache-spark pyspark apache-spark-sql

asked Sep 08 '16 at 06:03

Satya

5,470
17
47
72

194

votes

10 answers

How to select the first row of each group?

I have a DataFrame generated as follow: df.groupBy($"Hour", $"Category") .agg(sum($"value") as "TotalValue") .sort($"Hour".asc, $"TotalValue".desc)) The results look…

sql scala apache-spark dataframe apache-spark-sql

asked Nov 23 '15 at 18:49

Rami

8,044
18
66
108

182

votes

18 answers

Concatenate columns in Apache Spark DataFrame

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

sql apache-spark dataframe apache-spark-sql

asked Jul 16 '15 at 09:49

Nipun

4,119
5
47
83

181

votes

11 answers

How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform to RDD new_col = pd.DataFrame(randomed_hours,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 12 '15 at 21:14

Boris

2,005
2
11
10

181

votes

23 answers

How can I change column types in Spark SQL's DataFrame?

Suppose I'm doing something like: val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true")) df.printSchema() root |-- year: string (nullable = true) |-- make: string (nullable = true) |-- model: string…

scala apache-spark apache-spark-sql

asked Apr 01 '15 at 04:55

kevinykuo

4,600
5
23
31

178

votes

6 answers

How to sort by column in descending order in Spark SQL?

I tried df.orderBy("col1").show(10) but it sorted in ascending order. df.sort("col1").show(10) also sorts in ascending order. I looked on stackoverflow and the answers I found were all outdated or referred to RDDs. I'd like to use the native…

scala apache-spark apache-spark-sql

asked May 19 '15 at 17:45

Vedom

3,027
3
14
16

172

votes

11 answers

Filter Pyspark dataframe column with None value

I'm trying to filter a PySpark dataframe that has None as a row value: df.select('dt_mvmt').distinct().collect() [Row(dt_mvmt=u'2016-03-27'), Row(dt_mvmt=u'2016-03-28'), Row(dt_mvmt=u'2016-03-29'), Row(dt_mvmt=None), …

python apache-spark dataframe pyspark apache-spark-sql

asked May 16 '16 at 20:31

Ivan

19,560
31
97
141

171

votes

11 answers

Convert spark DataFrame column to python list

I work on a dataframe with two column, mvv and count. +---+-----+ |mvv|count| +---+-----+ | 1 | 5 | | 2 | 9 | | 3 | 3 | | 4 | 1 | i would like to obtain two list containing mvv values and count value. Something like mvv = [1,2,3,4] count =…

python apache-spark pyspark apache-spark-sql

asked Jul 27 '16 at 10:36

a.moussa

2,977
7
34
56

165

votes

14 answers

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name") I have tried: scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv") Error which I…

scala apache-spark hadoop apache-spark-sql hdfs

asked Apr 17 '15 at 16:10

Donbeo

17,067
37
114
188

163

votes

18 answers

How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that? PS: I want to check if it's empty so that I only save the DataFrame if it's not empty

apache-spark pyspark apache-spark-sql

asked Sep 22 '15 at 02:52

auxdx

2,313
3
21
25

156

votes

7 answers

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark. Following is the way, I did: toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType()) changedTypedf =…

python apache-spark dataframe pyspark apache-spark-sql

asked Aug 29 '15 at 09:34

Abhishek Choudhary

8,255
19
69
128

155

votes

12 answers

How to convert rdd object to dataframe in spark

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?

scala apache-spark apache-spark-sql rdd

asked Apr 01 '15 at 05:38

user568109

47,225
17
99
123

2 3

…

99 100 Next