0

I have used Spark EMR to copy tables from Oracle to S3 in parquet format, and then used Glue crawler to crawl the data from S3 and registered in Athena. The data ingestion is fine but when I tried to preview the data it showed this error:

GENERIC_INTERNAL_ERROR: integer overflow

I have tried the pipeline multiple times. The original schema is this:

SAMPLEINDEX(NUMBER38, 0)
GENEINDEX(NUMBER38, 0)
VALUE(FLOAT)
MINSEGMENTLENGTH(NUMBER38, 0)

I tried to cast the data into integer, long and string but the error still persists. I also inspected original dataset and didn't find any value that could cause int overflow.

Tables which contain rows < 800 millions work perfectly fine. But when the table has more than 800 millions rows the error started to come up.

Here is some sample code in scala:

    val table = sparkSession.read
      .format("jdbc")
      .option("url", "jdbc:oracle:thin://@XXX")
      .option("dbtable", "tcga.%s".format(tableName))
      .option("user", "XXX")
      .option("password", "XXX")
      .option("driver", "oracle.jdbc.driver.OracleDriver")
      .option("fetchsize", "50000")
      .option("numPartitions", "200")
      .load()

    println("writing tablename: %s".format(tableName))
    val finalDF = table.selectExpr("cast(SAMPLEINDEX as string) as SAMPLEINDEX", "cast(GENEINDEX as string) as GENEINDEX",
      "cast(VALUE as string) as VALUE", "cast(MINSEGMENTLENGTH as string) as MINSEGMENTLENGTH")
    finalDF.repartition(200)
    finalDF.printSchema()

    finalDF.write.format("parquet").mode("Overwrite").save("s3n://XXX/CNVGENELEVELDATATEST")
    finalDF.printSchema()
    finalDF.show()

Does anyone know what may cause the issue?

logydoghan
  • 11
  • 3
  • Lack of distribution on input - see [Partitioning in spark while reading from RDBMS via JDBC](https://stackoverflow.com/q/43150694/10465355), [How to optimize partitioning when migrating data from JDBC source?](https://stackoverflow.com/q/52603131/10465355), [Parallelizing Reads](https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads) – 10465355 Mar 23 '19 at 19:59
  • Thanks! It has solved the issue. I have added partitionBy a particular column and the error disappeared. However, there are now some missing rows. Should I partitionBy all of the columns? – logydoghan Mar 24 '19 at 04:27

0 Answers0