I wanted to import data from an Oracle database to our Hadoop HDFS and considered using Sqoop. When I tried, I discovered that the data connector for Oracle and Hadoop was disconnected.
2019-07-18 09:19:58,203 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.oracle.OraOopManagerFactory - Data Connector for Oracle and Hadoop is disabled.
I contacted the sysadmin and who let me know that, our Sqoop is currently not configured for Oracle databases and that it won't be. Instead they recommended using the below pyspark script.
I used the mentioned script on our CDSW and it worked really well with some configuration on my end.
import os
import netrc
from pyspark.sql import SparkSession
'''Set up the pyspark dependencies: In order to connect to the Oracle DB
via JDBC we are going to need the jar provided by Oracle'''
ORACLE_JAR = "ojdbc7.jar"
JAR_LOC = os.path.join(os.environ["JARS_DIR"], ORACLE_JAR)
#Create a SparkSession
spark = SparkSession.builder \
.appName("My Sample App") \
.config("spark.jars", "local://" + JAR_LOC) \
.getOrCreate()
# Set the Exadata host to which you are connecting to
db_host = "exadata.host.address"
#Read the values from the .netrc
user, account, password = netrc.netrc().authenticators(db_host)
# Check the Spark version and other config information
spark.sparkContext.getConf().getAll()
driver = "oracle.jdbc.OracleDriver"
#The SID of your database
sid = "mydb.sid.tns"
url = "/".join(["jdbc:oracle:thin:@/", db_host +":1521", sid])
# The query that you need to run
dbtable = "(select * from table)"
jdbc_df = spark.read.format("jdbc").option("url", url) \
.option("driver", driver) \
.option("dbtable", dbtable) \
.option("user", user) \
.option("password", password).load()
My question is: What are the relative advantages/disadvantages of using this pyspark script over Sqoop or Flume?