Is there a way to limit the number of records fetched from the jdbc source using spark sql 2.2.0?
I am dealing with a task of moving (and transforming) a large number of records >200M from one MS Sql Server table to another:
val spark = SparkSession
.builder()
.appName("co.smith.copydata")
.getOrCreate()
val sourceData = spark
.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", jdbcSqlConnStr)
.option("dbtable", sourceTableName)
.load()
.take(limit)
While it works, it is clearly first loading all the 200M records from the database first taking its sweet 18 min and then returns me the limited number of records I desire for testing and development purposes.
Switching around take(...) and load() produces compilation error.
I appreciate there are ways to copy sample data to a smaller table, use SSIS, or alternative etl tools.
I am really curious whether there is a way to achieve my goal using spark, sql and jdbc.