0

I'm trying to load all the files in a folder. They have they same schema, but sometimes have a different delimiter (ie Usually CSV, but occasionally tab separated)

Is there a way to pass in two delimiters? Being specific I don't want a two character delimiter "||", but to be able to treat multiple delimiters the same way.

I'm letting it infer the schema. Commas work, but tabbed rows just end up in the first column.

  • ideally those files were split in different folders. do they have a different file extension possibly? (.csv and .tsv)? In that case you can use `spark.read.option("delim", "|").csv("folder/*.tsv")` for the tab separated and `spark.read.csv("folder/*.csv")` for the rest – walking Sep 09 '22 at 12:12
  • Ideally! But no. Sadly I don't have control of the source data. Or the people that submit it. – WellyGus Sep 13 '22 at 00:32
  • May this helps : [Detect Delimiters](https://stackoverflow.com/questions/65909857/may-i-use-either-tab-or-comma-as-delimiter-when-reading-from-pandas-csv) – Sharma Jan 10 '23 at 13:06

0 Answers0