Highest Voted 'apache-spark-1.6' Questions

36

votes

3 answers

PySpark serialization EOFError

I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically…

asked Apr 12 '16 at 00:57

Tom Wallace

383
1
3
6

25

votes

2 answers

Reading CSV into a Spark Dataframe with timestamp and date types

It's CDH with Spark 1.6. I am trying to import this Hypothetical CSV into a apache Spark DataFrame: $ hadoop fs -cat test.csv a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a I use databricks-csv jar. val…

apache-spark apache-spark-sql apache-spark-1.6

asked Nov 30 '16 at 00:27

Mihir Shinde

657
2
8
13

19

votes

1 answer

Where is the reference for options for writing or reading per format?

I use Spark 1.6.1. We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use df.write().orc() we would rather do something like df.write().options(Map("format" -> "orc", "path" -> "/some_path") This…

apache-spark apache-spark-sql apache-spark-1.6

asked Jun 05 '17 at 08:44

Satyam

645
2
7
20

15

votes

2 answers

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?

scala apache-spark apache-spark-sql apache-spark-1.6

asked Jul 16 '17 at 17:27

Dzmitry Haikov

199
1
2
6

14

votes

1 answer

What to do with "WARN TaskSetManager: Stage contains a task of very large size"?

I use spark 1.6.1. My spark application reads more than 10000 parquet files stored in s3. val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*) myPaths is an Array[String] that contains the paths of the 10000 parquet files.…

apache-spark apache-spark-1.6

asked May 16 '17 at 08:49

reapasisow

275
1
2
9

11

votes

2 answers

Spark CrossValidatorModel access other models than the bestModel?

I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the…

apache-spark apache-spark-mllib cross-validation apache-spark-1.6

asked Aug 10 '16 at 13:14

MeiSign

1,487
1
15
39

10

votes

3 answers

Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row). from pyspark import SparkContext from pyspark.sql import…

apache-spark pyspark apache-spark-sql apache-spark-1.6

asked May 20 '16 at 03:14

Kamil Sindi

21,782
19
96
120

8

votes

2 answers

Why Spark application on YARN fails with FetchFailedException due to Connection refused?

I am using spark version 1.6.3 and yarn version 2.7.1.2.3 comes with HDP-2.3.0.0-2557. Becuase, spark version is too old in the HDP version that I use, I prefer to use another spark as yarn mode remotely. Here is how I run spark shell; ./spark-shell…

apache-spark hadoop-yarn apache-spark-1.6

asked Dec 30 '16 at 07:41

Ahmet DAL

4,445
9
47
71

7

votes

2 answers

How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with, sqlContext.getAllConfs.foreach(println) But, I am not…

apache-spark hive apache-spark-sql apache-spark-1.6

asked Jul 20 '17 at 08:46

Krishna Reddy

1,069
5
12
18

7

votes

1 answer

Dynamic Allocation for Spark Streaming

I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark…

apache-spark spark-streaming dynamic-allocation apache-spark-2.0 apache-spark-1.6

asked Dec 22 '16 at 23:02

Akhila Lankala

193
1
11

7

votes

3 answers

How to replace NULL to 0 in left outer join in SPARK dataframe v1.6

I am working Spark v1.6. I have the following two DataFrames and I want to convert the null to 0 in my left outer join ResultSet. Any suggestions? DataFrames val x: Array[Int] = Array(1,2,3) val df_sample_x = sc.parallelize(x).toDF("x") val y:…

scala apache-spark apache-spark-sql apache-spark-1.6

asked Nov 23 '16 at 18:55

Prasan

111
1
2
4

7

votes

2 answers

How to dynamically choose spark.sql.shuffle.partitions

I am currently processing the data using spark and foreach partition open a connection to mysql and insert it to the database in a batch of 1000. As mentioned in the SparkDocumentation default value of spark.sql.shuffle.partitions is 200 but i want…

apache-spark apache-spark-1.6

asked Jun 06 '16 at 14:43

Naresh

5,073
12
67
124

6

votes

3 answers

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1. Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names…

apache-spark hive elastic-map-reduce apache-spark-1.6

asked Jul 21 '16 at 00:36

Sam King

2,068
18
29

6

votes

2 answers

Running spark job not shown in the UI

I have submitted my spark job as mentioned here bin/spark-submit --class DataSet BasicSparkJob-assembly-1.0.jar without mentioning the --master parameter or spark.master parameter. Instead of that job gets submitted to my 3 node spark cluster. But i…

apache-spark apache-spark-1.6

asked Jul 11 '16 at 10:07

Naresh

5,073
12
67
124

5

votes

2 answers

PySpark- How to use a row value from one column to access another column which has the same name as of the row value

I have a PySpark df: +---+---+---+---+---+---+---+---+ | id| a1| b1| c1| d1| e1| f1|ref| +---+---+---+---+---+---+---+---+ | 0| 1| 23| 4| 8| 9| 5| b1| | 1| 2| 43| 8| 10| 20| 43| e1| | 2| 3| 15| 0| 1| 23| 7| b1| | 3| 4| 2| 6| 11| …

apache-spark pyspark apache-spark-sql apache-spark-1.6

asked Jan 24 '18 at 22:40

Mia21

119
2
10

Questions tagged [apache-spark-1.6]