Import CSV to pyspark dataframe

Question

I am new to pyspark, I am trying to load CSV file which looks like this:

my csv file:

   article_id   title                                  short_desc                                           
    33          novel findings support original        asco-cap guidelines support categorization of her2 by fish status used in bcirg clinical trials

my code to read the csv :

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType


spark = SparkSession.builder.appName('Basics').getOrCreate()
schema = StructType([
    StructField("article_id", IntegerType()),
    StructField("title", StringType()),
    StructField("short_desc", StringType()),
    StructField("article_desc", StringType())
])

peopleDF = spark.read.csv('temp.csv', header=True, schema=schema)

peopleDF.show(6)

why is null being added?

dataset sample so that same problem can be reproduced by you:

DataSet Sample

[Don't post pictures of data](https://meta.stackoverflow.com/a/285557/5858851) and don't post links to the data. Please try to provide an [mcve]. — pault, Apr 24 '18 at 15:22
edited as per your requirement, but dataset each record is huge, giving a sample of it. it wont be removed, i promise. — Sriram Arvind Lakshmanakumar, Apr 24 '18 at 15:35
That sample line is likely insufficient to reproduce your error. Could you include a few more rows? Read more on [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Apr 24 '18 at 15:40
thank you for your advise pault. i will follow them next time, as my error is fixed and sample two row dataset is provider. as promised i wont remove the sample dataset from online resource. — Sriram Arvind Lakshmanakumar, Apr 25 '18 at 16:50
you must also understand even 3 column dataset was very lengthly and i couldnt have pasted in stackoverflow window. sometimes normal users have reasons when they share csv of data set online — Sriram Arvind Lakshmanakumar, Apr 25 '18 at 16:54
Sure, but you didn't have to post your exact data. You could have made up an example that recreates your issue. — pault, Apr 25 '18 at 16:55

score 0 · Accepted Answer · answered Apr 24 '18 at 18:27

The cells of the excel sheet you are trying to read has 'merged cells'.

Spark will not read them as merged cells, but it will separate out the lines. In your case, the column 'article_desc' consists of such 5 cells vertically, and for the rest of the columns the cells are empty. Hence you have the null values.

If you get all the content to a single cell, you will be able to read it without the null values.

Import CSV to pyspark dataframe

1 Answers1