pyspark iterate over a hdfs directory and load data into multiple tables

Question

I have multiple csv files in a hdfs directory in hdfs:

/project/project_csv/file1.csv
/project/project_csv/file2.csv
/project/project_csv/file3.csv

Now, in my pyspark program I want to iterate over the path based on the number of files and each time want to store the data into a dataframe and load it data to specific table.

Like:
With the first file1.csv read to df and save to table1:

df = spark.read(file1.csv)
df.write.mode('overwrite').format('hive').saveAsTable(data_base.table_name1)

With the second file2.csv read to df and save to table2:

df = spark.read(file2.csv)
df.write.mode('overwrite').format('hive').saveAsTable(data_base.table_name2)

In the same way, want to iterate on multple files and save the data into different tables.

What is your question? You have done what you wanted to achieve. — mck, Nov 06 '20 at 07:25
https://stackoverflow.com/questions/35750614/pyspark-get-list-of-files-directories-on-hdfs-path — Dagang, Nov 07 '20 at 02:57

score 0 · Answer 1 · answered Nov 06 '20 at 07:36

0

You can use glob() to iterate through all the files in a specific folder and use a condition in order perform file specific operation as below.

* in order to loop through all the files/folder
.csv only to consider all csv files in that folder



 import glob
    files = glob.glob(r"C:\Users\path\*.csv")
    for i in files:
        if i.endswith("file1.csv"):
            df = spark.read(file1.csv)
            df.write.mode('overwrite').format('hive').saveAsTable(data_base.table_name1)

answered Nov 06 '20 at 07:36

dsk

1,863
2
10
13

Thanks. The csv files are location in hdfs directory. – Rocky1989 Nov 06 '20 at 19:07
Good to see that it helped you.. Can you please help accept and upvote if the solution helped you.. Will appreciate that – dsk Nov 09 '20 at 04:40

score 0 · Answer 2 · answered Nov 07 '20 at 02:56

0

I think what you want to ask is that how do I list files in a HDFS directory in Python. You can use the HdfsCLI package:

from hdfs import Config
client = Config().get_client('dev')
files = client.list('/path')

answered Nov 07 '20 at 02:56

Dagang

24,586
26
88
133

pyspark iterate over a hdfs directory and load data into multiple tables

2 Answers2