Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-2.0]
464 questions
64
votes
4 answers
Reading csv files with quoted fields containing embedded commas
I am reading a csv file in Pyspark as follows:
df_raw=spark.read.option("header","true").csv(csv_path)
However, the data file has quoted fields with embedded commas in them which
should not be treated as commas. How can I handle this in Pyspark ?…

femibyte
- 3,317
- 7
- 34
- 59
62
votes
5 answers
What are the various join types in Spark?
I looked at the docs and it says the following join types are supported:
Type of join to perform. Default inner. Must be one of: inner, cross,
outer, full, full_outer, left, left_outer, right, right_outer,
left_semi, left_anti.
I looked at…

pathikrit
- 32,469
- 37
- 142
- 221
59
votes
5 answers
Spark parquet partitioning : Large number of files
I am trying to leverage spark partitioning. I was trying to do something like
data.write.partitionBy("key").parquet("/location")
The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…

Avishek Bhattacharya
- 6,534
- 3
- 34
- 53
40
votes
5 answers
How to create SparkSession from existing SparkContext
I have a Spark application which using Spark 2.0 new API with SparkSession.
I am building this application on top of the another application which is using SparkContext. I would like to pass SparkContext to my application and initialize SparkSession…

Stefan Repcek
- 2,553
- 4
- 21
- 29
32
votes
3 answers
Spark 2.0 Dataset vs DataFrame
starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers:
What is the difference between
df.select("foo")
df.select($"foo")
do I understand correctly…

Georg Heiler
- 16,916
- 36
- 162
- 292
18
votes
2 answers
spark off heap memory config and tungsten
I thought that with the integration of project Tungesten, spark would automatically use off heap memory.
What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for…

Georg Heiler
- 16,916
- 36
- 162
- 292
17
votes
6 answers
Timeout Exception in Apache-Spark during program Execution
I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop.
The code exits with the following…

Yasir Arfat
- 645
- 1
- 8
- 21
17
votes
3 answers
dynamically bind variable/parameter in Spark SQL?
How to bind variable in Apache Spark SQL? For example:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)

user3769729
- 171
- 1
- 1
- 4
16
votes
0 answers
Spark executors crash due to netty memory leak
when running spark streaming app that consumes data from kafka topic with 100 partitions, and the streaming runs with 10 executors, 5 cores and 20GB RAM per executor, the executors crash with the following log:
ERROR ResourceLeakDetector: LEAK:…

Elad Eldor
- 803
- 1
- 12
- 22
15
votes
5 answers
Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]
The problem is the same as described here Error when starting spark-shell local on Mac
... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname.
So when I am not connected to internet,…

Aliostad
- 80,612
- 21
- 160
- 208
13
votes
1 answer
Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?
SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/tmp/spark")
.config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint")
.appName("my-test")
.getOrCreate
.readStream
…

Martin Brisiak
- 3,872
- 12
- 37
- 51
12
votes
1 answer
Spark2 Can't write dataframe to parquet hive table : HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`
I'm trying to save dataframe in table hive.
In spark 1.6 it's work but after migration to 2.2.0 it doesn't work anymore.
Here's the code:
blocs
.toDF()
.repartition($"col1", $"col2", $"col3", $"col4")
.write
…

youssef grati
- 121
- 1
- 1
- 5
12
votes
2 answers
Apache Spark vs Apache Spark 2
What are the improvements Apache Spark2 brings compared to Apache Spark?
From architecture perspective
From application point of view
or more

YoungHobbit
- 13,254
- 9
- 50
- 73
11
votes
2 answers
How to convert RDD of dense vector into DataFrame in pyspark?
I have a DenseVector RDD like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…

Hardik Gupta
- 4,700
- 9
- 41
- 83
10
votes
1 answer
Pass system property to spark-submit and read file from classpath or custom path
I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit). However, there is last piece missing.
The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have…

Atais
- 10,857
- 6
- 71
- 111