how to load large csv with many fields to Spark

Question

Happy New Year!!!

I know this type of similar question has been asked/answered before, however, mine is different:

I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)

Thank you very much.

UPDATE 1(2016.12.31.1:26pm EST):

I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:

Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.

>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv') 
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5) 
>>> df = rdd.toDF() 
>>> df.show(5)

What problem do you experience loading this dataset using standard methods? (for example: http://stackoverflow.com/a/34528938/7098262) — Mariusz, Dec 31 '16 at 17:14
Thanks. The problem is the 100+ fields, to explicitly add all the fields is a tedious job and I believe there should be a mature way to handle it — PasLeChoix, Dec 31 '16 at 18:15
Take a look at exact answer I referenced above - if you use spark-csv package to read file there is `header` options that will solve your problem easily. — Mariusz, Jan 01 '17 at 15:05
Thanks. `pyspark --packages com.databricks:spark-csv_2.10:1.4.0` resolve the issue in spark 1.6 — PasLeChoix, Jan 02 '17 at 13:53

score 5 · Answer 1 · edited May 23 '17 at 12:01

5

As noted in the comments you can use spark.read.csv for spark 2.0.0+ (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)

df = spark.read.csv('your_file.csv', header=True, inferSchema=True)

Setting header to True will parse the header to column names of the dataframe. Setting inferSchema to True will get the table schema (but will slow down reading).

how to load large csv with many fields to Spark

1 Answers1

Linked