0

I have two large spark data frames, user details and user Relationship. Both data frame has more than 20M records.

data frame join Join operation is very very slow. Please help me to improve join performance.

DF details

user_df

  • One user should have multiple addresses.
  • One user + address should have multiple networks

Schema

root
 |-- USERIDENTIFIER: string (nullable = true)
 |-- ADDRESSIDENTIFIER: string (nullable = true)
 |-- LATITUDE: decimal(15,12) (nullable = true)
 |-- LONGITUDE: decimal(15,12) (nullable = true)
 |-- USERNAME: string (nullable = true)
 |-- NETWORKIDENTIFIER: string (nullable = true)

sample data

+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|USERIDENTIFIER                  |ADDRESSIDENTIFIER               |LATITUDE       |LONGITUDE        |USERNAME                  |NETWORKIDENTIFIER               |
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|C9BBB242202692B589DC5E6AD1040229|0B60DA9CB69084711BC119CB7DB5A120|33.779730000000|-117.867278000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0EBFEB7F15B503D7F34BA4650E561D4B|33.804552000000|-118.067973000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0DC22C71A345C6750158E88D98D6671D|33.701665000000|-117.956545000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|086E9420C60A7D037FB127727967337B|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0E30A48D4829E093E60C7026351DFA04|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|33.610987000000|-117.712710000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|33.569658000000|-117.726847000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|33.439137000000|-117.621570000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson          |06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+

user_relationship_df

  • One useraddress can have relationship with another user(all addresses)

Schema

 root
     |-- PARENTUSERIDENTIFIER: string (nullable = true)
     |-- PARENTADDRESSIDENTIFIER: string (nullable = true)
     |-- CHILDUSERIDENTIFIER: string (nullable = true)
     |-- NETWORKIDENTIFIER: string (nullable = true)

sample data

+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|PARENTUSERIDENTIFIER            |PARENTADDRESSIDENTIFIER         |CHILDUSERIDENTIFIER             |NETWORKIDENTIFIER               |
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+

I have to perform bellow join for getting all Parent child relationship details.

user_relationship_df = user_df.alias('U1').join(
        user_relationship_df.alias('R'),
              [
                  f_col('U1.UserIdentifier') == f_col('R.parentUserIdentifier'),
                  f_col('U1.addressIdentifier') == f_col('R.parentAddressIdentifier'),
                  f_col('U1.networkIdentifier') == f_col('R.networkIdentifier'),
              ],
        'inner'
    ).join(
        user_df.alias('U2'),
        [
            f_col('U2.UserIdentifier') == f_col('R.childUserIdentifier'),
            f_col('U2.networkIdentifier') == f_col('R.networkIdentifier')
        ]
    ).select(
        f_col('U1.UserIdentifier').alias('parentUserIdentifier'),
        f_col('U1.AddressIdentifier').alias('parentAddressIdentifier'),
        f_col('U2.UserName').alias('parentUserName'),
        f_col('U1.latitude').alias('parentlatitude'),
        f_col('U1.longitude').alias('parentlongitude'),
        f_col('U2.UserIdentifier').alias('childUserIdentifier'),
        f_col('U2.AddressIdentifier').alias('childadressIdentifier'),
        f_col('U2.latitude').alias('childlatitude'),
        f_col('U2.longitude').alias('childlongitude'),
        f_col('U2.UserName').alias('childUserName'),
        f_col('R.networkIdentifier').alias('networkIdentifier')
    )

Above join operation is very slow. How can I improve the performance ?

bellow one is the write operation after join.

expected output after join

+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|parentUserIdentifier|parentAddressIdentifier|      parentUserName| parentlatitude|  parentlongitude| childUserIdentifier|childadressIdentifier|  childlatitude|   childlongitude|       childUserName|   networkIdentifier|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
user_relationship_details_df.coalesce(1000).write.option(
            'maxRecordsPerFile', 100000).mode('overwrite').format('json').save('hdfs_path')

pyspark command used to execute.

spark-submit script.py
SuFi
  • 355
  • 1
  • 3
  • 17

2 Answers2

0

I can suggest you to try two things

  1. Try to increase the number of partitions in coalesce more than one to enable parallelism which will improve the performance of write operation. 2.try using partitionby and give the partition column Which will improve the performance of write operation
0

If the issue is with write performance, then instead of using coalesce() use repartition(). Coalesce() is meant to decrease the number of partitions where repartition() can be used to increase the number of partitions and thereby increase parallelism. If you increase the repartition count very large wont really help until you have enough resources. More details about coalesce vs repartition

Felix K Jose
  • 782
  • 7
  • 10
  • joining coursing slowness, not write operation. – SuFi Jul 04 '21 at 14:52
  • Here are some guidelines to look at for optimizing joins: https://databricks.com/session/optimizing-apache-spark-sql-joins Also if one of your table is small, then you could do **broadcast** join. – Felix K Jose Jul 05 '21 at 05:00