Questions tagged [apache-spark-1.6]

Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].

111 questions
36
votes
3 answers

PySpark serialization EOFError

I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically…
Tom Wallace
  • 383
  • 1
  • 3
  • 6
25
votes
2 answers

Reading CSV into a Spark Dataframe with timestamp and date types

It's CDH with Spark 1.6. I am trying to import this Hypothetical CSV into a apache Spark DataFrame: $ hadoop fs -cat test.csv a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a I use databricks-csv jar. val…
Mihir Shinde
  • 657
  • 2
  • 8
  • 13
19
votes
1 answer

Where is the reference for options for writing or reading per format?

I use Spark 1.6.1. We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use df.write().orc() we would rather do something like df.write().options(Map("format" -> "orc", "path" -> "/some_path") This…
Satyam
  • 645
  • 2
  • 7
  • 20
15
votes
2 answers

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
14
votes
1 answer

What to do with "WARN TaskSetManager: Stage contains a task of very large size"?

I use spark 1.6.1. My spark application reads more than 10000 parquet files stored in s3. val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*) myPaths is an Array[String] that contains the paths of the 10000 parquet files.…
reapasisow
  • 275
  • 1
  • 2
  • 9
11
votes
2 answers

Spark CrossValidatorModel access other models than the bestModel?

I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the…
10
votes
3 answers

Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row). from pyspark import SparkContext from pyspark.sql import…
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120
8
votes
2 answers

Why Spark application on YARN fails with FetchFailedException due to Connection refused?

I am using spark version 1.6.3 and yarn version 2.7.1.2.3 comes with HDP-2.3.0.0-2557. Becuase, spark version is too old in the HDP version that I use, I prefer to use another spark as yarn mode remotely. Here is how I run spark shell; ./spark-shell…
Ahmet DAL
  • 4,445
  • 9
  • 47
  • 71
7
votes
2 answers

How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with, sqlContext.getAllConfs.foreach(println) But, I am not…
Krishna Reddy
  • 1,069
  • 5
  • 12
  • 18
7
votes
1 answer

Dynamic Allocation for Spark Streaming

I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark…
7
votes
3 answers

How to replace NULL to 0 in left outer join in SPARK dataframe v1.6

I am working Spark v1.6. I have the following two DataFrames and I want to convert the null to 0 in my left outer join ResultSet. Any suggestions? DataFrames val x: Array[Int] = Array(1,2,3) val df_sample_x = sc.parallelize(x).toDF("x") val y:…
Prasan
  • 111
  • 1
  • 2
  • 4
7
votes
2 answers

How to dynamically choose spark.sql.shuffle.partitions

I am currently processing the data using spark and foreach partition open a connection to mysql and insert it to the database in a batch of 1000. As mentioned in the SparkDocumentation default value of spark.sql.shuffle.partitions is 200 but i want…
Naresh
  • 5,073
  • 12
  • 67
  • 124
6
votes
3 answers

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1. Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names…
Sam King
  • 2,068
  • 18
  • 29
6
votes
2 answers

Running spark job not shown in the UI

I have submitted my spark job as mentioned here bin/spark-submit --class DataSet BasicSparkJob-assembly-1.0.jar without mentioning the --master parameter or spark.master parameter. Instead of that job gets submitted to my 3 node spark cluster. But i…
Naresh
  • 5,073
  • 12
  • 67
  • 124
5
votes
2 answers

PySpark- How to use a row value from one column to access another column which has the same name as of the row value

I have a PySpark df: +---+---+---+---+---+---+---+---+ | id| a1| b1| c1| d1| e1| f1|ref| +---+---+---+---+---+---+---+---+ | 0| 1| 23| 4| 8| 9| 5| b1| | 1| 2| 43| 8| 10| 20| 43| e1| | 2| 3| 15| 0| 1| 23| 7| b1| | 3| 4| 2| 6| 11| …
Mia21
  • 119
  • 2
  • 10
1
2 3 4 5 6 7 8