I am quite new to PySpark (or Spark in general). I am trying to connect Spark with a MySQL instance I have running on RDS. When I load the table like so, does Spark load the entire table in memory?
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.jars", "/usr/share/java/mysql-connector-java-8.0.33.jar") \
.master("spark://spark-master:7077") \
.appName("app_name") \
.getOrCreate()
table_1_df = spark.read.format("jdbc").option("url", "jdbc:mysql://mysql:3306/some_db") \
.option("driver", "com.mysql.jdbc.Driver") \
.option("dbtable", "table1") \
.option("user", "user") \
.option("password", "pass") \
.load()
print(table_1_df.head())
If yes, is there a way to limit it, say by asking Spark to load contents based on a condition? I would like to see if its possible to limit the fetch by (say) a primary key. Any input would be helpful. Thank you