1

I have multiple csv files in a hdfs directory in hdfs:

/project/project_csv/file1.csv
/project/project_csv/file2.csv
/project/project_csv/file3.csv

Now, in my pyspark program I want to iterate over the path based on the number of files and each time want to store the data into a dataframe and load it data to specific table.

Like:
With the first file1.csv read to df and save to table1:

df = spark.read(file1.csv)
df.write.mode('overwrite').format('hive').saveAsTable(data_base.table_name1)

With the second file2.csv read to df and save to table2:

df = spark.read(file2.csv)
df.write.mode('overwrite').format('hive').saveAsTable(data_base.table_name2)

In the same way, want to iterate on multple files and save the data into different tables.

Dagang
  • 24,586
  • 26
  • 88
  • 133
Rocky1989
  • 369
  • 8
  • 28

2 Answers2

0

You can use glob() to iterate through all the files in a specific folder and use a condition in order perform file specific operation as below.

* in order to loop through all the files/folder
.csv only to consider all csv files in that folder



 import glob
    files = glob.glob(r"C:\Users\path\*.csv")
    for i in files:
        if i.endswith("file1.csv"):
            df = spark.read(file1.csv)
            df.write.mode('overwrite').format('hive').saveAsTable(data_base.table_name1)
    
dsk
  • 1,863
  • 2
  • 10
  • 13
0

I think what you want to ask is that how do I list files in a HDFS directory in Python. You can use the HdfsCLI package:

from hdfs import Config
client = Config().get_client('dev')
files = client.list('/path')
Dagang
  • 24,586
  • 26
  • 88
  • 133