1

The Spark csv readers are not as flexible as pandas.read_csv and do not seem to be able to handle parsing dates of different formats etc. Is there a good way of passing pandas DataFrames to Spark Dataframes in an ETL map step? Spark createDataFrame does not appear to always work. Likely the typing system has not been mapping exhaustively? Paratext looks promising but likely new and not yet heavily used.

For example here: Get CSV to Spark dataframe

Community
  • 1
  • 1
mathtick
  • 6,487
  • 13
  • 56
  • 101
  • databricks spark csv has been merged into spark 2.1.x ... is this different from the new read.csv? I guess it also does not solve the multiple date format issues. – mathtick Mar 16 '17 at 13:00
  • Do you have different date formats in the same file ? – Alex Mar 16 '17 at 13:14
  • 1
    Using PySpark 2.1 this is the best you can do `df = spark.read.csv(header='true', inferSchema='true',path='data.csv', dateFormat="yyyy-MM-dd",timestampFormat="yyyy-MM-dd'T'HH:mm:ss.SSSZZ")` – Alex Mar 16 '17 at 13:17

0 Answers0