I have two large spark data frames, user details and user Relationship. Both data frame has more than 20M records.
data frame join Join operation is very very slow. Please help me to improve join performance.
DF details
user_df
- One user should have multiple addresses.
- One user + address should have multiple networks
Schema
root
|-- USERIDENTIFIER: string (nullable = true)
|-- ADDRESSIDENTIFIER: string (nullable = true)
|-- LATITUDE: decimal(15,12) (nullable = true)
|-- LONGITUDE: decimal(15,12) (nullable = true)
|-- USERNAME: string (nullable = true)
|-- NETWORKIDENTIFIER: string (nullable = true)
sample data
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|USERIDENTIFIER |ADDRESSIDENTIFIER |LATITUDE |LONGITUDE |USERNAME |NETWORKIDENTIFIER |
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|C9BBB242202692B589DC5E6AD1040229|0B60DA9CB69084711BC119CB7DB5A120|33.779730000000|-117.867278000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0EBFEB7F15B503D7F34BA4650E561D4B|33.804552000000|-118.067973000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0DC22C71A345C6750158E88D98D6671D|33.701665000000|-117.956545000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|086E9420C60A7D037FB127727967337B|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0E30A48D4829E093E60C7026351DFA04|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|33.610987000000|-117.712710000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|33.569658000000|-117.726847000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|33.439137000000|-117.621570000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson |06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
user_relationship_df
- One useraddress can have relationship with another user(all addresses)
Schema
root
|-- PARENTUSERIDENTIFIER: string (nullable = true)
|-- PARENTADDRESSIDENTIFIER: string (nullable = true)
|-- CHILDUSERIDENTIFIER: string (nullable = true)
|-- NETWORKIDENTIFIER: string (nullable = true)
sample data
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|PARENTUSERIDENTIFIER |PARENTADDRESSIDENTIFIER |CHILDUSERIDENTIFIER |NETWORKIDENTIFIER |
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
I have to perform bellow join for getting all Parent child relationship details.
user_relationship_df = user_df.alias('U1').join(
user_relationship_df.alias('R'),
[
f_col('U1.UserIdentifier') == f_col('R.parentUserIdentifier'),
f_col('U1.addressIdentifier') == f_col('R.parentAddressIdentifier'),
f_col('U1.networkIdentifier') == f_col('R.networkIdentifier'),
],
'inner'
).join(
user_df.alias('U2'),
[
f_col('U2.UserIdentifier') == f_col('R.childUserIdentifier'),
f_col('U2.networkIdentifier') == f_col('R.networkIdentifier')
]
).select(
f_col('U1.UserIdentifier').alias('parentUserIdentifier'),
f_col('U1.AddressIdentifier').alias('parentAddressIdentifier'),
f_col('U2.UserName').alias('parentUserName'),
f_col('U1.latitude').alias('parentlatitude'),
f_col('U1.longitude').alias('parentlongitude'),
f_col('U2.UserIdentifier').alias('childUserIdentifier'),
f_col('U2.AddressIdentifier').alias('childadressIdentifier'),
f_col('U2.latitude').alias('childlatitude'),
f_col('U2.longitude').alias('childlongitude'),
f_col('U2.UserName').alias('childUserName'),
f_col('R.networkIdentifier').alias('networkIdentifier')
)
Above join operation is very slow. How can I improve the performance ?
bellow one is the write operation after join.
expected output after join
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|parentUserIdentifier|parentAddressIdentifier| parentUserName| parentlatitude| parentlongitude| childUserIdentifier|childadressIdentifier| childlatitude| childlongitude| childUserName| networkIdentifier|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
user_relationship_details_df.coalesce(1000).write.option(
'maxRecordsPerFile', 100000).mode('overwrite').format('json').save('hdfs_path')
pyspark command used to execute.
spark-submit script.py