0

I would like to insert some data my table "test" in Hive through a pySpark script (python for Spark).

I created first of all a table "animals" in the HUE's graphic interface for Hive thank to the query :

CREATE TABLE animals( id int, animal String) 

Thus I got a new table. I wrote this script in order to add a new line (1, dog) to it :

from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext

sc = SparkContext()
hc = HiveContext(sc)
dataToInsert = hc.sql("select 1 as id, 'dog' as animal")
dataToInsert.write.mode("append").insertInto("animals")

After having executed it (several times with "sudo pyspark myscript.py" in the terminal), it seems it doesn't add anything.

Do you know how to fix that or another way to insert data into a Hive/Impala table through a python script for Spark please ?

Thanks in advance !

Olscream
  • 127
  • 1
  • 14
  • _it seems it doesn't add anything_ is substantially different than _PySpark Hive SQL - NULL data inserted_. Also, do you see records in `hc.table("animals").show()`? If yes, then you didn't correctly configure Spark. – Alper t. Turker Jan 25 '18 at 13:47
  • What is `animals` and what is `test`? – Bala Jan 25 '18 at 14:06
  • @user8371915 : You're right, I changed the title to a more meaningful name. Thanks. – Olscream Jan 25 '18 at 14:10
  • @Bala : I just did a mistake when I wrote this post but my table is actually "animals" here. I edited it. Just consider "animals" as the name of the table I would like to query. – Olscream Jan 25 '18 at 14:12
  • @Olscream, I see you are saying it's not inserting the rows after following the below answer and your comments to it, does that mean you are getting any errors or it's just not inserting anything without giving the error? – roh Jan 25 '18 at 19:08

2 Answers2

1

It is probably because you have created a table called test and you are inserting into animals which may not exist.

try with changing animals to test. Make sure you end with sc.stop() and start using spark-submit

from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext

sc = SparkContext()
hc = HiveContext(sc)
dataToInsert = hc.sql("select 1 as id, 'dog' as animal")
dataToInsert.write.mode("append").insertInto("test")
sc.stop()

It just works fine. What version are you using? enter image description here

Bala
  • 11,068
  • 19
  • 67
  • 120
  • Thank you but I have a table named "animals" in my DB. I just did a mistake when I wrote this post, sorry. But except this, do you see other potential errors ? – Olscream Jan 25 '18 at 14:14
  • I finally tried again. Everything related to my DB structure would be fine (with animals). I added "sc.stop()" like you and started the script thank to a command line "sudo spark-submit myscript.py" but it added anything to my table. My version of Spark is 1.6.0 ? Do I have to update it ? If yes, how to do it please ? – Olscream Jan 25 '18 at 17:08
  • You don't have to update. The version I used was 1.6.0 (Cloudera quick VM). – Bala Jan 26 '18 at 19:38
0

Looks like the issue is at insertInto Try using the saveAsTable instead, insertInto is deprecated in spark 1.4.

Something like dataToInsert.write.mode("append").saveAsTable("SchemaName.animals") note that schema name is mandatroy.

if the above one leaves you with no luck try writing to the hive tables HDFS path. I would also suggest you repair the hive table after running the pyspark job sql command for it : msck repair table <schema.tablename>; either from hiveContext or hive.

Why do you need msck repair ?? check out my other answer here

Here is Doc snapshot :

saveAsTable(name, format=None, mode=None, partitionBy=None, **options) Saves the content of the DataFrame as the specified table.

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.

append: Append contents of this DataFrame to existing data.

overwrite: Overwrite existing data. error: Throw an exception if data already exists.

ignore: Silently ignore this operation if data already exists. Parameters: name – the table name format – the format used to save

mode – one of append, overwrite, error, ignore (default: error)

partitionBy – names of partitioning columns options – all other string options New in version 1.4.

roh
  • 1,033
  • 1
  • 11
  • 19