0

Im using spark to read a csv. In my csv file I have two columns named TradeDate and SettleDate that are of type date in the format yyyyMMdd as you can see in the print bellow:

enter image description here

And Im reading the csv file like this:

public static DataFrame ReadFile(string path, FileConfiguration fileconfig, SparkSession spark)
        {
            bool hasHeader = fileconfig.FileLoaderFileContainsHeader != 0 || fileconfig.FileLoaderNumberOfLinesToSkip != 0;
            return spark
               .Read()
               .Option("delimiter", fileconfig.FileLoaderColumnSeparator)
               .Option("header", hasHeader)
               .Option("inferSchema", true)
               .Option("dateFormat", "yyyyMMdd")
               .Csv(path);
        }

I also tried:

public static DataFrame ReadFile(string path, FileConfiguration fileconfig, SparkSession spark)
        {
            bool hasHeader = fileconfig.FileLoaderFileContainsHeader != 0 || fileconfig.FileLoaderNumberOfLinesToSkip != 0;
            return spark
               .Read()
               .Option("delimiter", fileconfig.FileLoaderColumnSeparator)
               .Option("header", hasHeader)
               .Option("inferSchema", true)
               .Option("TimeStampFormat", "yyyyMMdd")
               .Csv(path);
        }

But the problem is that when I do the DataFrame.PrintSchema() this columns are retrieved as integer

DataFrame DataframeSource = FileService.ReadFile(AppConfiguration.PathSource, fileConfigurationSource, spark);

DataframeSource.PrintSchema();

I cannot convert the columns in date format "in the hand" because I using this script to work with multiple csv files, and the columns names are different in spite of the date format is the same. For example in this csv file the column name is TradeDate but in another is FixDate, so I have to to this in the moment of importation

enter image description here

Pugnatore
  • 395
  • 3
  • 19
  • 2
    Does this answer your question? [How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?](https://stackoverflow.com/questions/46529404/how-to-force-inferschema-for-csv-to-consider-integers-as-dates-with-dateformat) – 10465355 Mar 17 '20 at 12:51
  • `inferSchema` is brittle. I always define the schema explicitly unless the number of columns is really high. Explicit schema also improves reading times for large CSV datasets as the files are only read once. – Hristo Iliev Mar 17 '20 at 14:00
  • yes bu the problem is that I have a lot of columns in the csv file, so it doesnt seem very vaible infer the schma explicitly – Pugnatore Mar 17 '20 at 17:07

0 Answers0