Load csv data to spark dataframes using pd.read_csv?

Asked Mar 16 '17 at 08:33

Active Mar 16 '17 at 08:33

Viewed 805 times

The Spark csv readers are not as flexible as pandas.read_csv and do not seem to be able to handle parsing dates of different formats etc. Is there a good way of passing pandas DataFrames to Spark Dataframes in an ETL map step? Spark createDataFrame does not appear to always work. Likely the typing system has not been mapping exhaustively? Paratext looks promising but likely new and not yet heavily used.

For example here: Get CSV to Spark dataframe

edited May 23 '17 at 12:09

Community

asked Mar 16 '17 at 08:33

mathtick

6,487
13
56
101

databricks spark csv has been merged into spark 2.1.x ... is this different from the new read.csv? I guess it also does not solve the multiple date format issues. – mathtick Mar 16 '17 at 13:00
Do you have different date formats in the same file ? – Alex Mar 16 '17 at 13:14
1

Using PySpark 2.1 this is the best you can do `df = spark.read.csv(header='true', inferSchema='true',path='data.csv', dateFormat="yyyy-MM-dd",timestampFormat="yyyy-MM-dd'T'HH:mm:ss.SSSZZ")` – Alex Mar 16 '17 at 13:17

Load csv data to spark dataframes using pd.read_csv?

0 Answers0