I am trying to read multiple parquet files from multiple partitions via pyspark, and concatenate them to one big data frame. The files look like,
hdfs dfs -ls /data/customers/odysseyconsultants/logs_ch_blade_fwvpn
Found 180 items
drwxrwxrwx - impala impala 0 2018-03-01 10:31 /data/customers/odysseyconsultants/logs_ch_blade_fwvpn/_impala_insert_staging
drwxr-xr-x - impala impala 0 2017-08-23 17:55 /data/customers/odysseyconsultants/logs_ch_blade_fwvpn/cdateint=20170822
drwxr-xr-x - impala impala 0 2017-08-24 05:57 /data/customers/odysseyconsultants/logs_ch_blade_fwvpn/cdateint=20170823
drwxr-xr-x - impala impala 0 2017-08-25 06:00 /data/customers/odysseyconsultants/logs_ch_blade_fwvpn/cdateint=20170824
drwxr-xr-x - impala impala 0 2017-08-26 06:04 /data/customers/odysseyconsultants/logs_ch_blade_fwvpn/cdateint=20170825
Each partition has either one or multiple parquet files, i.e.
hdfs dfs -ls /data/customers/odysseyconsultants/logs_ch_blade_fwvpn/cdateint=20170822
Found 1 items
-rw-r--r-- 2 impala impala 72252308 2017-08-23 17:55 /data/customers/odysseyconsultants/logs_ch_blade_fwvpn/cdateint=20170822/5b4bb1c5214fdffd-cc8dbcf600000008_1393229110_data.0.parq
What I m trying to create is a generic function that will take a from - to
argument and load and concatenate all the parquet files of that time range in a big data frame.
I can create the files to be read,
def read_files(table,from1,to):
s1 = ', '.join('/data/customers/odysseyconsultants/' + table + '/' + 'cdateint=' + str(i) for i in range(from1, to+1))
return s1.split(', ')
If I attempt to read the files, as follows, I get an exception
for i in read_files('logs_ch_blade_fwvpn', 20170506, 20170510):
... sqlContext.read.parquet(i).show()
If I try to read it
x = read_files('logs_cs_blade_fwvpn', 20180109, 20180110)
d1 = sqlContext.read.parquet(*x)
I get error
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://nameservice1/data/customers/odysseyconsultants/logs_cs_blade_fwvpn/cdateint=20180109;'