0

I am trying to load my tables on PostgreSQL to Spark. I have successfully read the table from PostgreSQL to Spark using jdbc. I have a code written in R, which I want to use on the table, but I cannot access the data in R.

using the following code to connect

 val pgDF_table = spark.read
                          .format("jdbc")
                          .option("driver", "org.postgresql.Driver")
                          .option("url", "jdbc:postgresql://10.128.0.4:5432/sparkDB")
                          .option("dbtable", "survey_results")
                          .option("user", "prashant")
                          .option("password","pandey")
                          .load()
    pgDF_table.show 

is there any option as spark.write?

Vasudha Jain
  • 93
  • 2
  • 10
  • You may find https://spark.rstudio.com helpful. It allows you to define Spark jobs directly from R, rather than Scala, and bring data from Spark into R for further processing. – TroyHurts Jul 29 '19 at 10:54
  • but my data is in postgresql, there's nothing related to it. – Vasudha Jain Jul 29 '19 at 11:11

1 Answers1

0

In SparkR,

You can read data from JDBC using the following code:

read.jdbc(url, tableName, partitionColumn = NULL, lowerBound = NULL,
  upperBound = NULL, numPartitions = 0L, predicates = list(), ...)

Arguments

`url':  JDBC database url of the form 'jdbc:subprotocol:subname'

`tableName':    the name of the table in the external database

`partitionColumn':  the name of a column of integral type that will be used for partitioning

`lowerBound':   the minimum value of 'partitionColumn' used to decide partition stride

`upperBound':   the maximum value of 'partitionColumn' used to decide partition stride

`numPartitions':    the number of partitions, This, along with 'lowerBound' (inclusive), 'upperBound' (exclusive), form partition strides for generated WHERE clause expressions used to split the column 'partitionColumn' evenly. This defaults to SparkContext.defaultParallelism when unset.

`predicates':   a list of conditions in the where clause; each one defines one partition

Data can be written to JDBC using the following code:

write.jdbc(x, url, tableName, mode = "error", ...)

Arguments

`x`: a SparkDataFrame.

`url`: JDBC database url of the form jdbc:subprotocol:subname.

`tableName`: yhe name of the table in the external database.

`mode`: one of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default).

`...`: additional JDBC database connection properties.

JDBC Driver must be in spark classpath

bob
  • 4,595
  • 2
  • 25
  • 35
  • using jdbc gives me following error: Error: Error in jdbc : java.sql.SQLException: No suitable driver ERROR RBackendHandler: jdbc on 16 failed java.lang.reflect.InvocationTargetException – Vasudha Jain Jul 29 '19 at 11:40
  • add postgresql jdbc driver in classpath. https://www.dataxone.com/import-export-postgresql-data-sparkr-dataframe/ – bob Jul 29 '19 at 11:46
  • after adding the path successcully, i get the following error when reading the file: ```ERROR: operator does not exist: character varying = integer Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.```` – Vasudha Jain Jul 30 '19 at 06:02
  • so any idea how we can define the type cast explicitly in predicates? – Vasudha Jain Jul 30 '19 at 06:38
  • @VasudhaJain: i guess you are trying to compare integer with varchar. check this ans https://stackoverflow.com/a/25358092/5019163 – bob Jul 30 '19 at 10:05
  • Thanks, Bob, I figured it out and sorted it. – Vasudha Jain Jul 31 '19 at 05:25