Dataset Spark Read CSV api

Question

We are trying to create a dataset which reads folder having both excel, txt and csv files using following:

SparkSession.option("header", "true")
                            .option("delimiter", delimiter).option("ignoreTrailingWhiteSpace",true)
                            .option("ignoreLeadingWhitespaces", true)
                            .csv(directoryPath + "\\" + feedFolder + "\\*");

CSV api, is reading excel files as well, thus creating garbage data like below. How can we not read .xlsx file using CSV api of Apache Spark? Kindly let know

|?�9L�ҙ�sbgٮ |�l!��USh9i�b�r:"y_dl��D�� |-N��R"4�2�G�%��Z�4�˝y�7\të��ɂ��

Does this answer your question? [How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?](https://stackoverflow.com/questions/44196741/how-to-construct-dataframe-from-a-excel-xls-xlsx-file-in-scala-spark) — Koedlt, Jun 07 '23 at 06:28
Where are you trying to run this code? Is it on an emr cluster, databricks or locally? And is it on scala or pyspark? — Rohit Anil, Jun 07 '23 at 06:29
The obvious answer would appear to be: change the file filter to only include `.csv` i.e. `csv(directoryPath + "\\" + feedFolder + "\\*.csv");` — Nick.Mc, Jun 07 '23 at 06:34
We can have both .txt as well .csv, or either of it. If it doesnt find .txt in folder, it will fail, if we add both in csv(paths) — Daksh, Jun 07 '23 at 07:32

score 1 · Answer 1 · answered Jun 07 '23 at 07:53

1

you can use this syntaxe to add different paths at the same time if you're working with a spark version < 3.0.0

SparkSession.option("header", "true")
            .option("delimiter", delimiter)
            .option("ignoreTrailingWhiteSpace",true)
            .option("ignoreLeadingWhitespaces", true)
            .csv(directoryPath + "\\" + feedFolder + "\\*.csv", directoryPath + "\\" + feedFolder + "\\*.txt");

answered Jun 07 '23 at 07:53

shalnarkftw

402
2
8

Yes, but it will fail if directory doesnt contain .txt, – Daksh Jun 07 '23 at 07:56
you should manage errors and exception by doing try catch.. do a verification process on the files before creating your dataframe. – shalnarkftw Jun 07 '23 at 08:08

score 0 · Answer 2 · answered Jun 07 '23 at 06:43

0

you can include a csv filter ?

csv(directoryPath + "\\" + feedFolder + "\\*.csv")

You can also try changing the mode to DROPMALFORMED

.option("mode","DROPMALFORMED")

DROPMALFORMED: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.

This way if you read corrupted data (ie data from other file format), there are a huge chance that the row is dropped.

answered Jun 07 '23 at 06:43

maxime G

1,660
1
10
27

We can have both .txt as well .csv, or either of it. If it doesn't find .txt in folder, it will fail, if we add both in csv(paths) – Daksh Jun 07 '23 at 07:33
why it will fail ? that dosen't make sence. you txt file is csv format aswell ? – maxime G Jun 07 '23 at 08:04

Dataset Spark Read CSV api

2 Answers2