I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value
column because it's all over the place and doesn't maintain order):
df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load()
df.printSchema()
root
|-- id: string (nullable = false)
|-- timestamp: timestamp (nullable = false)
|-- key: string (nullable = false)
|-- value: double (nullable = false)
Instead, I am converting the dataframe into an rdd (of enumerated tuples) and trying to partition that instead:
rdd = df.rdd.flatMap(lambda x: enumerate(x)).partitionBy(20)
Note that I used 20
because I have 5 workers with one core each in my cluster, and 5*4=20
.
Unfortunately, the following command still takes forever to execute:
result = rdd.first()
Therefore I am wondering if my logic above makes sense? Am I doing anything wrong? From the web GUI, it looks like the workers are not being used: