How can I read any CSV files in a folder and merge into the one CSV file

Question

I have a folder labeled 'input' with multiple CSV files in it. They all have the same columns names but the data is different in each CSV file.

How can I use Spark and Java to go to the folder labeled 'input', read all the CSV files in that folder, and merge those CSV files into one file.

The files in the folder may change, e.g. might have 4 CSV files and another day have 6 and so on so forth.

Dataset<Row> df = (
        spark.read()
                .format("com.databricks.spark.csv")
                .option("header", "true")
                .load("/Users/input/*.csv")
);

However, I don't get an output, Spark just shuts down.

I don't want to list all the CSV files in the folder, I want the code to take any CSV files present in that folder and merge. Is this possible?

From there I can use that one CSV file to convert into a data frame.

@philantrovert I did see that and tried a few but I get no output. Can you see anything wrong? — JoeyOC, Oct 06 '21 at 12:50

jgp · Accepted Answer · 2021-10-06T18:41:51.470

In your example, you may have used an older version of the data source. The new datasource ("csv") may work better:

Dataset<Row> df = spark.read()
                    .format("csv")
                    .option("header", true)
                    .load("/Users/input/*.csv");
df.show();

should work.

You can find a complete example there: https://github.com/jgperrin/net.jgp.books.spark.ch01 and with multiple files there: https://github.com/jgperrin/net.jgp.books.spark.ch15/blob/master/src/main/java/net/jgp/books/spark/ch15/lab300_nyc_school_stats/NewYorkSchoolStatisticsApp.java

How can I read any CSV files in a folder and merge into the one CSV file

1 Answers1