1

So here's my spark code on python which I execute with hadoop running on the background:

    from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *


if __name__ == "__main__":
    sc = SparkContext(appName="CSV2Parquet")
    sqlContext = SQLContext(sc)

    schema = StructType([
            StructField("ID", IntegerType(), True),
            StructField("name", StringType(), True),
            StructField("insert_timestamp_utc", TimestampType(), True),
            StructField("value", DoubleType(), True)])

    #rdd = sc.textFile("parquet-test-2.csv").map(lambda line: line.split(","))
    #df = sqlContext.createDataFrame(rdd, schema)
    df = sqlContext.read.csv("parquet-test-2.csv", header=True, sep=",", schema=schema)
    df.show()
    df.write.parquet('output-parquet')

The show function works properly with my schema, and it shows up the info correctly, converting the empty values to null. However, when the code gets up to the write function, I'm met with errors, I'm guessing it's due to the null values but I haven't been able to deal with it.

Can you guys help me with that?

Here's a link to the error text in question: https://shrib.com/#T.GjdcJbgl9tfEYAsxsV

I'm new to StackOverflow as a user(I usually find my answer by lurking in the forums). If there's any additional info you need to help me with this, please let me know and I'll add it.

  • have you looked at [this question](https://stackoverflow.com/questions/40764807/null-entry-in-command-string-exception-in-saveastextfile-on-pyspark) ? – mtoto Jan 27 '19 at 12:00
  • @mtoto I tried it and now I get the following error: https://shrib.com/#l4lCZ24c8ZOzEMY0qQbV – Pedro González Jan 27 '19 at 13:54
  • I was able to nail it, after pasting the exe to my hadoop bin folder and setting the environmental variables correctly, the code performed everything smoothly. – Pedro González Jan 27 '19 at 15:11

0 Answers0