Highest Voted 'apache-spark-2.0' Questions

64

votes

4 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…

asked Nov 04 '16 at 00:34

femibyte

3,317
7
34
59

62

votes

5 answers

What are the various join types in Spark?

I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at…

scala apache-spark apache-spark-sql apache-spark-2.0

asked Aug 31 '17 at 21:55

pathikrit

32,469
37
142
221

59

votes

5 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…

apache-spark apache-spark-sql rdd apache-spark-2.0 bigdata

asked Jun 28 '17 at 16:49

Avishek Bhattacharya

6,534
3
34
53

40

votes

5 answers

How to create SparkSession from existing SparkContext

I have a Spark application which using Spark 2.0 new API with SparkSession. I am building this application on top of the another application which is using SparkContext. I would like to pass SparkContext to my application and initialize SparkSession…

scala apache-spark apache-spark-2.0

asked Mar 21 '17 at 18:20

Stefan Repcek

2,553
4
21
29

32

votes

3 answers

Spark 2.0 Dataset vs DataFrame

starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly…

scala apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

asked Nov 14 '16 at 19:44

Georg Heiler

16,916
36
162
292

18

votes

2 answers

spark off heap memory config and tungsten

I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for…

apache-spark apache-spark-sql apache-spark-2.0 off-heap

asked Apr 10 '17 at 18:55

Georg Heiler

16,916
36
162
292

17

votes

6 answers

Timeout Exception in Apache-Spark during program Execution

I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following…

scala apache-spark spark-graphx apache-spark-2.0

asked Nov 22 '16 at 11:32

Yasir Arfat

645
1
8
21

17

votes

3 answers

dynamically bind variable/parameter in Spark SQL?

How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)

scala apache-spark apache-spark-sql apache-spark-2.0

asked Nov 05 '14 at 10:44

user3769729

171
1
1
4

16

votes

0 answers

Spark executors crash due to netty memory leak

when running spark streaming app that consumes data from kafka topic with 100 partitions, and the streaming runs with 10 executors, 5 cores and 20GB RAM per executor, the executors crash with the following log: ERROR ResourceLeakDetector: LEAK:…

out-of-memory netty spark-streaming apache-spark-2.0

asked Oct 12 '17 at 15:49

Elad Eldor

803
1
12
22

15

votes

5 answers

Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

The problem is the same as described here Error when starting spark-shell local on Mac ... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname. So when I am not connected to internet,…

macos shell apache-spark apache-spark-2.0

asked Jan 28 '17 at 20:38

Aliostad

80,612
21
160
208

13

votes

1 answer

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream …

scala apache-spark apache-spark-sql apache-spark-2.0 spark-structured-streaming

asked Feb 06 '17 at 07:07

Martin Brisiak

3,872
12
37
51

12

votes

1 answer

Spark2 Can't write dataframe to parquet hive table : HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`

I'm trying to save dataframe in table hive. In spark 1.6 it's work but after migration to 2.2.0 it doesn't work anymore. Here's the code: blocs .toDF() .repartition($"col1", $"col2", $"col3", $"col4") .write …

apache-spark hive parquet apache-spark-2.0

asked Jan 09 '19 at 14:42

youssef grati

121
1
1
5

12

votes

2 answers

Apache Spark vs Apache Spark 2

What are the improvements Apache Spark2 brings compared to Apache Spark? From architecture perspective From application point of view or more

apache-spark apache-spark-2.0

asked Oct 21 '16 at 05:03

YoungHobbit

13,254
9
50
73

11

votes

2 answers

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…

apache-spark pyspark apache-spark-mllib apache-spark-ml apache-spark-2.0

asked Dec 26 '16 at 09:05

Hardik Gupta

4,700
9
41
83

10

votes

1 answer

Pass system property to spark-submit and read file from classpath or custom path

I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have…

java scala apache-spark apache-spark-2.0 spark-submit

asked Aug 03 '17 at 17:17

Atais

10,857
6
71
111

Questions tagged [apache-spark-2.0]