0

We are trying to create a dataset which reads folder having both excel, txt and csv files using following:

SparkSession.option("header", "true")
                            .option("delimiter", delimiter).option("ignoreTrailingWhiteSpace",true)
                            .option("ignoreLeadingWhitespaces", true)
                            .csv(directoryPath + "\\" + feedFolder + "\\*");

CSV api, is reading excel files as well, thus creating garbage data like below. How can we not read .xlsx file using CSV api of Apache Spark? Kindly let know

|?�9L�ҙ�sbgٮ |�l!��USh9i�b�r:"y_dl��D��� |-N��R"4�2�G�%��Z�4�˝y�7\të��ɂ���

Daksh
  • 148
  • 11
  • 1
    Does this answer your question? [How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?](https://stackoverflow.com/questions/44196741/how-to-construct-dataframe-from-a-excel-xls-xlsx-file-in-scala-spark) – Koedlt Jun 07 '23 at 06:28
  • Where are you trying to run this code? Is it on an emr cluster, databricks or locally? And is it on scala or pyspark? – Rohit Anil Jun 07 '23 at 06:29
  • The obvious answer would appear to be: change the file filter to only include `.csv` i.e. `csv(directoryPath + "\\" + feedFolder + "\\*.csv");` – Nick.Mc Jun 07 '23 at 06:34
  • We can have both .txt as well .csv, or either of it. If it doesnt find .txt in folder, it will fail, if we add both in csv(paths) – Daksh Jun 07 '23 at 07:32

2 Answers2

1

you can use this syntaxe to add different paths at the same time if you're working with a spark version < 3.0.0

SparkSession.option("header", "true")
            .option("delimiter", delimiter)
            .option("ignoreTrailingWhiteSpace",true)
            .option("ignoreLeadingWhitespaces", true)
            .csv(directoryPath + "\\" + feedFolder + "\\*.csv", directoryPath + "\\" + feedFolder + "\\*.txt");
shalnarkftw
  • 402
  • 2
  • 8
0

you can include a csv filter ?

csv(directoryPath + "\\" + feedFolder + "\\*.csv")

You can also try changing the mode to DROPMALFORMED

.option("mode","DROPMALFORMED")

DROPMALFORMED: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.

This way if you read corrupted data (ie data from other file format), there are a huge chance that the row is dropped.

maxime G
  • 1,660
  • 1
  • 10
  • 27
  • We can have both .txt as well .csv, or either of it. If it doesn't find .txt in folder, it will fail, if we add both in csv(paths) – Daksh Jun 07 '23 at 07:33
  • why it will fail ? that dosen't make sence. you txt file is csv format aswell ? – maxime G Jun 07 '23 at 08:04