I have a dataframe with 4 million rows and 10 columns. I am trying to write this to a table in hdfs from the Cloudera Data Science Workbench using pyspark. I am running into an error when trying to do this:
[Stage 0:> (0 + 1) /
2]19/02/20 12:31:04 ERROR datasources.FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 318690577 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.
I can break up the dataframe into 3 dataframes and perform the spark write 3 seperate times but I would like to do this just one time if possible by possibly adding something to the spark code like coalesce
.
import pandas as pd
df=pd.read_csv('BulkWhois/2019-02-20_Arin_Bulk/Networks_arin_db_2-20-2019_parsed.csv')
'''PYSPARK'''
from pyspark.sql import SQLContext
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark import SparkContext
spark = SparkSession.builder.appName('Arin_Network').getOrCreate()
schema = StructType([StructField('NetHandle', StringType(), False),
StructField('OrgID', StringType(), True),
StructField('Parent', StringType(), True),
StructField('NetName', StringType(), True),
StructField('NetRange', StringType(), True),
StructField('NetType', StringType(), True),
StructField('Comment', StringType(), True),
StructField('RegDate', StringType(), True),
StructField('Updated', StringType(), True),
StructField('Source', StringType(), True)])
dataframe = spark.createDataFrame(df, schema)
dataframe.write. \
mode("append"). \
option("path", "/user/hive/warehouse/bulkwhois_analytics.db/arin_network"). \
saveAsTable("bulkwhois_analytics.arin_network")