I am trying to use spark sql from spark 2 on cloudera environment and getting the folowing error:
'pyspark.sql.utils.AnalysisException: u'Cannot up cast
other_column_from_table
from decimal(32,22) to decimal(30,22) as it may truncate\n;''
We not use this column other_column_from_table
that SPARK SQL tries to cast in the select statement, and it is the cause of error. Below is the code:
enter code herespark2-submit /home/adonnert/teste_alexandre.py pyspark --deploy-mode cluster --driver-cores 2 --driver-memory 4G --executor-cores 2 --executor-memory 6G --name --master --conf "spark.sql.parquet.writeLegacyFormat=true"
import sys
from pyspark import SparkConf, SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.functions import coalesce
from pyspark.sql.functions import from_unixtime
import time
import traceback
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from datetime import datetime, timedelta, date
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,ArrayType,MapType
spark = SparkSession.builder.appName("PySparkSQL_VRJ_EC_GDC_ALE") \
.enableHiveSupport() \
.config('hive.exec.dynamic.partition', 'True') \
.config('hive.exec.dynamic.partition.mode','nonstrict') \
.config("spark.debug.maxToStringFields","200") \
.config("spark.sql.shuffle.partition", "200") \
.config("spark.sql.inMemoryColumnarStorage.compressed", True) \
.config("spark.sql.inMemoryColumnarStorage.batchSize",10000) \
.config("spark.sql.codegen",True) \
.getOrCreate()
df_lead = spark.sql("""
SELECT
my_id,
value_number
FROM owner.table
WHERE date >= CAST(DATE_FORMAT(ADD_MONTHS(current_timestamp(),-13),'yyyyMM') AS BIGINT)
""").show(10)
Is there a way to deal with it as not allow the spark sql make a cast of column that is not called? It does not even generate the df to use some schema.