Save a file in HDFS from Pyspark

Question

I have an empty table in Hive I mean there are no records in that table.

Using this empty table I have created a data frame in pyspark

df = sqlContext.table("testing.123_test")

I have registered this data frame as an temp table in

df.registerTempTable('mytempTable')

date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')

In this table I have column called id.

Now I want to query the temp table like below

min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']

max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']

Now I want to save date, min_id and max_id into a file in HDFS

I have done like below:

from pyspark.sql import functions as f

(sqlContext.table("myTempTable").select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id"))).coalesce(1).write.format("text").mode("append").save("/tmp/fooo"))

Now when I check the file in HDFS it show all NULL values.

The file output in HDFS is below.

NULL,NULL,NULL

What I want is

Date,0,0

Here date is the current timestamp

How can I achieve what I want.

I don't understand what you're trying to do here. Why read an empty table and then write `Date,0,0` to HDFS. Could you please elaborate a bit more? — philantrovert, Jun 10 '17 at 05:24
@philantrovert Here I am trying to save the data of some tables in directory in `HDFS` I am able to do this if the table has records, but if the table is empty then I am facing the above scenario — , Jun 10 '17 at 12:58
It's funny. I know that code from https://stackoverflow.com/a/44315328/3415409 — eliasah, Jun 10 '17 at 17:29
empty tables shouldn't be persisted. You just need to check count of the dataframe. So I honestly don't see what you are trying to do here — eliasah, Jun 10 '17 at 18:35
@eliasah What you said is correct but this table might have data when I run this script next time. So if we have empty tables it should right to `HDFS` file like my requirement — , Jun 11 '17 at 02:21

score 0 · Answer 1 · answered Jun 10 '17 at 06:43

This is in scala but you should be easily able to replicate it to Python. The function you need here is na.fill function. And you'll have to replace Scala Maps with Python Dictionaries in the below code:

This is what your DF looks like:

scala> nullDF.show
+----+----+----+
|date|   x|   y|
+----+----+----+
|null|null|null|
+----+----+----+

// You have already done this using Python's datetime functions
val format = new java.text.SimpleDateFormat("dd/MM/YYYY HH:mm:ss")
val curr_timestamp = format.format(new java.util.Date())

//Use na fill to replace null values
//Column names as keys in map
//And values are what you want to replace NULL with

val df = nullDF.na.fill(scala.collection.immutable.Map(
         "date" -> ) ,
         "x" -> "0" ,
         "y" -> "0" ) )

This should give you

+-------------------+---+---+
|               date|  x|  y|
+-------------------+---+---+
|10/06/2017 12:10:20|  0|  0|
+-------------------+---+---+

Save a file in HDFS from Pyspark

1 Answers1