I am new to pyspark, I am trying to load CSV file which looks like this:
my csv file:
article_id title short_desc
33 novel findings support original asco-cap guidelines support categorization of her2 by fish status used in bcirg clinical trials
my code to read the csv :
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType
spark = SparkSession.builder.appName('Basics').getOrCreate()
schema = StructType([
StructField("article_id", IntegerType()),
StructField("title", StringType()),
StructField("short_desc", StringType()),
StructField("article_desc", StringType())
])
peopleDF = spark.read.csv('temp.csv', header=True, schema=schema)
peopleDF.show(6)
why is null being added?
dataset sample so that same problem can be reproduced by you: