PySpark read in multiple files CSV or TSV

Asked Sep 09 '22 at 08:57

Active Sep 09 '22 at 08:57

Viewed 139 times

I'm trying to load all the files in a folder. They have they same schema, but sometimes have a different delimiter (ie Usually CSV, but occasionally tab separated)

Is there a way to pass in two delimiters? Being specific I don't want a two character delimiter "||", but to be able to treat multiple delimiters the same way.

I'm letting it infer the schema. Commas work, but tabbed rows just end up in the first column.

asked Sep 09 '22 at 08:57

WellyGus

ideally those files were split in different folders. do they have a different file extension possibly? (.csv and .tsv)? In that case you can use `spark.read.option("delim", "|").csv("folder/*.tsv")` for the tab separated and `spark.read.csv("folder/*.csv")` for the rest – walking Sep 09 '22 at 12:12
Ideally! But no. Sadly I don't have control of the source data. Or the people that submit it. – WellyGus Sep 13 '22 at 00:32
May this helps : [Detect Delimiters](https://stackoverflow.com/questions/65909857/may-i-use-either-tab-or-comma-as-delimiter-when-reading-from-pandas-csv) – Sharma Jan 10 '23 at 13:06

PySpark read in multiple files CSV or TSV

0 Answers0