Questions tagged [apache-spark-2.0]

Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].

464 questions
64
votes
4 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…
femibyte
  • 3,317
  • 7
  • 34
  • 59
62
votes
5 answers

What are the various join types in Spark?

I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at…
pathikrit
  • 32,469
  • 37
  • 142
  • 221
59
votes
5 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…
40
votes
5 answers

How to create SparkSession from existing SparkContext

I have a Spark application which using Spark 2.0 new API with SparkSession. I am building this application on top of the another application which is using SparkContext. I would like to pass SparkContext to my application and initialize SparkSession…
Stefan Repcek
  • 2,553
  • 4
  • 21
  • 29
32
votes
3 answers

Spark 2.0 Dataset vs DataFrame

starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly…
18
votes
2 answers

spark off heap memory config and tungsten

I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
17
votes
6 answers

Timeout Exception in Apache-Spark during program Execution

I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following…
Yasir Arfat
  • 645
  • 1
  • 8
  • 21
17
votes
3 answers

dynamically bind variable/parameter in Spark SQL?

How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)
user3769729
  • 171
  • 1
  • 1
  • 4
16
votes
0 answers

Spark executors crash due to netty memory leak

when running spark streaming app that consumes data from kafka topic with 100 partitions, and the streaming runs with 10 executors, 5 cores and 20GB RAM per executor, the executors crash with the following log: ERROR ResourceLeakDetector: LEAK:…
Elad Eldor
  • 803
  • 1
  • 12
  • 22
15
votes
5 answers

Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

The problem is the same as described here Error when starting spark-shell local on Mac ... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname. So when I am not connected to internet,…
Aliostad
  • 80,612
  • 21
  • 160
  • 208
13
votes
1 answer

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream …
12
votes
1 answer

Spark2 Can't write dataframe to parquet hive table : HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`

I'm trying to save dataframe in table hive. In spark 1.6 it's work but after migration to 2.2.0 it doesn't work anymore. Here's the code: blocs .toDF() .repartition($"col1", $"col2", $"col3", $"col4") .write …
youssef grati
  • 121
  • 1
  • 1
  • 5
12
votes
2 answers

Apache Spark vs Apache Spark 2

What are the improvements Apache Spark2 brings compared to Apache Spark? From architecture perspective From application point of view or more
YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
11
votes
2 answers

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…
10
votes
1 answer

Pass system property to spark-submit and read file from classpath or custom path

I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have…
Atais
  • 10,857
  • 6
  • 71
  • 111
1
2 3
30 31