I have used Spark EMR to copy tables from Oracle to S3 in parquet format, and then used Glue crawler to crawl the data from S3 and registered in Athena. The data ingestion is fine but when I tried to preview the data it showed this error:
GENERIC_INTERNAL_ERROR: integer overflow
I have tried the pipeline multiple times. The original schema is this:
SAMPLEINDEX(NUMBER38, 0)
GENEINDEX(NUMBER38, 0)
VALUE(FLOAT)
MINSEGMENTLENGTH(NUMBER38, 0)
I tried to cast the data into integer, long and string but the error still persists. I also inspected original dataset and didn't find any value that could cause int overflow.
Tables which contain rows < 800 millions work perfectly fine. But when the table has more than 800 millions rows the error started to come up.
Here is some sample code in scala:
val table = sparkSession.read
.format("jdbc")
.option("url", "jdbc:oracle:thin://@XXX")
.option("dbtable", "tcga.%s".format(tableName))
.option("user", "XXX")
.option("password", "XXX")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("fetchsize", "50000")
.option("numPartitions", "200")
.load()
println("writing tablename: %s".format(tableName))
val finalDF = table.selectExpr("cast(SAMPLEINDEX as string) as SAMPLEINDEX", "cast(GENEINDEX as string) as GENEINDEX",
"cast(VALUE as string) as VALUE", "cast(MINSEGMENTLENGTH as string) as MINSEGMENTLENGTH")
finalDF.repartition(200)
finalDF.printSchema()
finalDF.write.format("parquet").mode("Overwrite").save("s3n://XXX/CNVGENELEVELDATATEST")
finalDF.printSchema()
finalDF.show()
Does anyone know what may cause the issue?