So here's my spark code on python which I execute with hadoop running on the background:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
if __name__ == "__main__":
sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)
schema = StructType([
StructField("ID", IntegerType(), True),
StructField("name", StringType(), True),
StructField("insert_timestamp_utc", TimestampType(), True),
StructField("value", DoubleType(), True)])
#rdd = sc.textFile("parquet-test-2.csv").map(lambda line: line.split(","))
#df = sqlContext.createDataFrame(rdd, schema)
df = sqlContext.read.csv("parquet-test-2.csv", header=True, sep=",", schema=schema)
df.show()
df.write.parquet('output-parquet')
The show function works properly with my schema, and it shows up the info correctly, converting the empty values to null. However, when the code gets up to the write function, I'm met with errors, I'm guessing it's due to the null values but I haven't been able to deal with it.
Can you guys help me with that?
Here's a link to the error text in question: https://shrib.com/#T.GjdcJbgl9tfEYAsxsV
I'm new to StackOverflow as a user(I usually find my answer by lurking in the forums). If there's any additional info you need to help me with this, please let me know and I'll add it.