3

Happy New Year!!!

I know this type of similar question has been asked/answered before, however, mine is different:

I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)

Thank you very much.

UPDATE 1(2016.12.31.1:26pm EST):

I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:

enter image description here Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.

>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv') 
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5) 
>>> df = rdd.toDF() 
>>> df.show(5) 
Yaron
  • 10,166
  • 9
  • 45
  • 65
PasLeChoix
  • 311
  • 1
  • 5
  • 21
  • What problem do you experience loading this dataset using standard methods? (for example: http://stackoverflow.com/a/34528938/7098262) – Mariusz Dec 31 '16 at 17:14
  • Thanks. The problem is the 100+ fields, to explicitly add all the fields is a tedious job and I believe there should be a mature way to handle it – PasLeChoix Dec 31 '16 at 18:15
  • 1
    Take a look at exact answer I referenced above - if you use spark-csv package to read file there is `header` options that will solve your problem easily. – Mariusz Jan 01 '17 at 15:05
  • Thanks. `pyspark --packages com.databricks:spark-csv_2.10:1.4.0` resolve the issue in spark 1.6 – PasLeChoix Jan 02 '17 at 13:53

1 Answers1

5

As noted in the comments you can use spark.read.csv for spark 2.0.0+ (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)

df = spark.read.csv('your_file.csv', header=True, inferSchema=True)

Setting header to True will parse the header to column names of the dataframe. Setting inferSchema to True will get the table schema (but will slow down reading).

See also here: Load CSV file with Spark

Community
  • 1
  • 1
O. Gindele
  • 376
  • 3
  • 6