Loading a dataframe with check in pyspark is giving me empty dataframe

Question

I am trying to load data in a dataframe using pyspark. The files are in parquet format. I am using the following code

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField,IntegerType,StringType,BooleanType,DateType,TimestampType,LongType,FloatType,DoubleType,ArrayType,ShortType
from pyspark.sql import HiveContext
from pyspark.sql.functions import lit
import datetime
from pyspark import SparkContext
from pyspark import SQLContext
from datetime import datetime
from datetime import *
from datetime import date, timedelta as td
import datetime
from datetime import datetime
from pyspark import SparkContext
from pyspark.sql import HiveContext

import pandas as pd
daterange = pd.date_range('2019-12-01','2019-12-31')

df = sqlContext.createDataFrame(sc.emptyRDD())

for process_date in daterange:
try:
    name = 's3://location/process_date={}'.format(process_date.strftime("%Y-%m-%d"))+'/'
    print(name)
    x = spark.read.parquet(name)
    x = x.withColumn('process_date',lit(process_date.strftime("%Y-%m-%d")))
    x.show()
    df = df.union(x)
except:
    print("File doesnt exist for"+str(process_date.strftime("%Y-%m-%d")))

But when i am running this code, i am getting the output df is an empty data set and despite having data for some dates, i am getting exception print message in all the date range. Can anyone guide me what i am doing wrong?

[How to create an empty DataFrame with a specified schema?](https://stackoverflow.com/q/31477598/10938362) — user10938362, Jan 17 '20 at 13:47
What is the output of name variable? Does it match with s3 folder name — Salim, Jan 17 '20 at 13:48
Yeah I have checked the path. The path matches with the name variable — Bitanshu Das, Jan 18 '20 at 06:12

score 1 · Accepted Answer · answered Jan 18 '20 at 10:52

1

I think the problem is the union and a too broad except clause.
Union will only work if the schemas of the dataframes to be unioned is the same.
Hence emptyDF.union(nonEmtpy) raises an error that you catch in the except clause.

answered Jan 18 '20 at 10:52

Paul

1,114
8
11

The schema is consistent so union shouldn't cause an issue. The exception is written to see if there is file present in the location. So my use case is to load data in a df for a range of dates while checking if the data is present or not – Bitanshu Das Jan 19 '20 at 11:29
If union with a blank df causes this issue, i can create a struct schema and use it to create a blank df with the schema – Bitanshu Das Jan 19 '20 at 11:30
Yes it looks like union with blank is the issue. Have you tried creating the empty df with the correct schema? – Paul Jan 19 '20 at 18:14
Thanks for the suggestion. It works once I provide a schema – Bitanshu Das Jan 20 '20 at 07:09

Loading a dataframe with check in pyspark is giving me empty dataframe

1 Answers1