Happy New Year!!!
I know this type of similar question has been asked/answered before, however, mine is different:
I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)
Thank you very much.
UPDATE 1(2016.12.31.1:26pm EST):
I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:
Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.
>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv')
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5)
>>> df = rdd.toDF()
>>> df.show(5)