Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.6]
111 questions
36
votes
3 answers
PySpark serialization EOFError
I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically…

Tom Wallace
- 383
- 1
- 3
- 6
25
votes
2 answers
Reading CSV into a Spark Dataframe with timestamp and date types
It's CDH with Spark 1.6.
I am trying to import this Hypothetical CSV into a apache Spark DataFrame:
$ hadoop fs -cat test.csv
a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a
I use databricks-csv jar.
val…

Mihir Shinde
- 657
- 2
- 8
- 13
19
votes
1 answer
Where is the reference for options for writing or reading per format?
I use Spark 1.6.1.
We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use
df.write().orc()
we would rather do something like
df.write().options(Map("format" -> "orc", "path" -> "/some_path")
This…

Satyam
- 645
- 2
- 7
- 20
15
votes
2 answers
How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?
In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?

Dzmitry Haikov
- 199
- 1
- 2
- 6
14
votes
1 answer
What to do with "WARN TaskSetManager: Stage contains a task of very large size"?
I use spark 1.6.1.
My spark application reads more than 10000 parquet files stored in s3.
val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*)
myPaths is an Array[String] that contains the paths of the 10000 parquet files.…

reapasisow
- 275
- 1
- 2
- 9
11
votes
2 answers
Spark CrossValidatorModel access other models than the bestModel?
I am using Spark 1.6.1:
Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the…

MeiSign
- 1,487
- 1
- 15
- 39
10
votes
3 answers
Get first non-null values in group by (Spark 1.6)
How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row).
from pyspark import SparkContext
from pyspark.sql import…

Kamil Sindi
- 21,782
- 19
- 96
- 120
8
votes
2 answers
Why Spark application on YARN fails with FetchFailedException due to Connection refused?
I am using spark version 1.6.3 and yarn version 2.7.1.2.3 comes with HDP-2.3.0.0-2557. Becuase, spark version is too old in the HDP version that I use, I prefer to use another spark as yarn mode remotely.
Here is how I run spark shell;
./spark-shell…

Ahmet DAL
- 4,445
- 9
- 47
- 71
7
votes
2 answers
How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?
Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with,
sqlContext.getAllConfs.foreach(println)
But, I am not…

Krishna Reddy
- 1,069
- 5
- 12
- 18
7
votes
1 answer
Dynamic Allocation for Spark Streaming
I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark…

Akhila Lankala
- 193
- 1
- 11
7
votes
3 answers
How to replace NULL to 0 in left outer join in SPARK dataframe v1.6
I am working Spark v1.6. I have the following two DataFrames and I want to convert the null to 0 in my left outer join ResultSet. Any suggestions?
DataFrames
val x: Array[Int] = Array(1,2,3)
val df_sample_x = sc.parallelize(x).toDF("x")
val y:…

Prasan
- 111
- 1
- 2
- 4
7
votes
2 answers
How to dynamically choose spark.sql.shuffle.partitions
I am currently processing the data using spark and foreach partition open a connection to mysql and insert it to the database in a batch of 1000. As mentioned in the SparkDocumentation default value of spark.sql.shuffle.partitions is 200 but i want…

Naresh
- 5,073
- 12
- 67
- 124
6
votes
3 answers
How to register S3 Parquet files in a Hive Metastore using Spark on EMR
I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.
Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names…

Sam King
- 2,068
- 18
- 29
6
votes
2 answers
Running spark job not shown in the UI
I have submitted my spark job as mentioned here bin/spark-submit --class DataSet BasicSparkJob-assembly-1.0.jar without mentioning the --master parameter or spark.master parameter. Instead of that job gets submitted to my 3 node spark cluster. But i…

Naresh
- 5,073
- 12
- 67
- 124
5
votes
2 answers
PySpark- How to use a row value from one column to access another column which has the same name as of the row value
I have a PySpark df:
+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|
+---+---+---+---+---+---+---+---+
| 0| 1| 23| 4| 8| 9| 5| b1|
| 1| 2| 43| 8| 10| 20| 43| e1|
| 2| 3| 15| 0| 1| 23| 7| b1|
| 3| 4| 2| 6| 11| …

Mia21
- 119
- 2
- 10