Im using spark to read a csv. In my csv file I have two columns named TradeDate and SettleDate that are of type date in the format yyyyMMdd as you can see in the print bellow:
And Im reading the csv file like this:
public static DataFrame ReadFile(string path, FileConfiguration fileconfig, SparkSession spark)
{
bool hasHeader = fileconfig.FileLoaderFileContainsHeader != 0 || fileconfig.FileLoaderNumberOfLinesToSkip != 0;
return spark
.Read()
.Option("delimiter", fileconfig.FileLoaderColumnSeparator)
.Option("header", hasHeader)
.Option("inferSchema", true)
.Option("dateFormat", "yyyyMMdd")
.Csv(path);
}
I also tried:
public static DataFrame ReadFile(string path, FileConfiguration fileconfig, SparkSession spark)
{
bool hasHeader = fileconfig.FileLoaderFileContainsHeader != 0 || fileconfig.FileLoaderNumberOfLinesToSkip != 0;
return spark
.Read()
.Option("delimiter", fileconfig.FileLoaderColumnSeparator)
.Option("header", hasHeader)
.Option("inferSchema", true)
.Option("TimeStampFormat", "yyyyMMdd")
.Csv(path);
}
But the problem is that when I do the DataFrame.PrintSchema() this columns are retrieved as integer
DataFrame DataframeSource = FileService.ReadFile(AppConfiguration.PathSource, fileConfigurationSource, spark);
DataframeSource.PrintSchema();
I cannot convert the columns in date format "in the hand" because I using this script to work with multiple csv files, and the columns names are different in spite of the date format is the same. For example in this csv file the column name is TradeDate but in another is FixDate, so I have to to this in the moment of importation