Reading 100M rows from postgreSQL to Spark and writing to parquet

Question

I am trying to read a big table to spark(~100M rows). The table is in PostGreSQL and we are reading as follows:

val connectionProperties = new Properties()
connectionProperties.put("user", "$USER")
connectionProperties.put("password", "$PASSWORD")

// val connection = DriverManager.getConnection("$SERVER", "$USER", "$PASSWORD")
//connection.isClosed()

val jdbc_url = s"jdbc:postgresql://${"$HOST"}:${$PORT}/${"$DB"}"
val df = spark.read.option("inferSchema", true).jdbc(jdbc_url, "$TABLE", connectionProperties)

But our SQL table has 2 columns as money datatype(in format $100,000.23). When reading in spark, it gets converted into double and throws exception.

We have tried doing : a) Casting column data to Double. But that is not helping as Spark also automatically casts to double. It is having problem with , in values. (used b) The data is already in DataFrame with commas. Trying to use PostgreSQL dialects(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala)

Any help will be really appreciated.

Did you try to invoke a SELECT query, with conversion of that lame Money datatype into a more portable datatype *before* Spark receives the data ? E.g. `(select A, B, cast(C as decimal(16,2)) as CC from TBL) x` — Samson Scharfrichter, Aug 10 '17 at 18:44
Also `decimal(,)` avoids rounding errors that are inevitable with floating-point types. And play hell with financial use cases... — Samson Scharfrichter, Aug 10 '17 at 18:48

addmeaning · Answer 1 · 2017-08-11T07:08:42.967

2

You can try to specify schema manually, read the column as a string then manually parse value defining User Defined Function.

To specify schema manually you need write something like this

    val schema =
  StructType(
    StructField("your-example-column1", IntegerType, true) ::
    StructField("your-money-column", StringType, true) :: Nil)
    spark.read.schema(schema)

See Spark Scala API :

To learn more about how to convert StringType to data type you need refer to this question

edited Aug 11 '17 at 07:08

answered Aug 10 '17 at 13:51

addmeaning

1,358
1
13
36

2

It is not considered a good practice to inferSchema in production environment. You should always specify schema manually. – Michel Lemay Aug 10 '17 at 19:11
As mentioned in the pseudocode, we have tried the withColumn and selectExpr to change the datatype after data is in Dataframe. Did not work as double is giving exception for comma values. – Pranab Aug 11 '17 at 06:17
It shouldn't work because dataframe can't infer your type. Instead, you should specify schema _before_ you reading data to dataframe. – addmeaning Aug 11 '17 at 07:01

Reading 100M rows from postgreSQL to Spark and writing to parquet

1 Answers1