4

I am running a spark job and I need to read from a HDFS table which is in lets say HadoopCluster-1. Now I want the aggregate dataframe into a table which is present in another HadoopCluster-2. What would be the best way to do it?

  1. I am thinking of below approach: Before writing the data to target table, read the hdfs-site.xml and core-site.xml using addResource. Then copy all the config values into a Map<String,String> Then set these conf values into my dataset.sparkSession.SparkContext.hadoopConfiguration().

Is this a good way to achieve my goal ?

GearFour
  • 63
  • 8
  • 1
    You can read and write between clusters using `hdfs://cluster-X:port/path` values – OneCricketeer Aug 02 '21 at 17:16
  • @OneCricketeer I was using the same approach but I am getting the below error: `java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient` – GearFour Aug 05 '21 at 11:07
  • Sounds like a Hive error, not related to reading/writing HDFS data directly – OneCricketeer Aug 05 '21 at 15:01
  • Yes. I was able to get rid of the hive dependency. Thanks. – GearFour Aug 06 '21 at 05:47
  • @OneCricketeer As soon as my code executes : `df.write().mode(SaveMode.Append).save(hdfs://cluster-2:port/path` , I get the below error: `java.io.Exception: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via: [TOKEN,KERBEROS]; Host details : local host is \"cluster-1/10.20.30.40\" ; destination host is \"cluster-2\":port ` I have checked and all the kerberos settings are correct. The UGI has been setup correctly. The same user has access to both the clusters. Any suggestions on this? – GearFour Aug 10 '21 at 10:41
  • I've not used Spark/Hadoop in a Kerberos environment, but AFAIK, only one HDFS client can be authenticated at a time. `Distcp` might be a better option, but not really sure – OneCricketeer Aug 10 '21 at 12:27

2 Answers2

2

If you want to read hive table from cluster1 as a dataframe and write it as hive table in cluster2 after transforming dataframe, you can try below approach.

  1. Make sure hiveserver2 is running on both cluster. command to run server is

hive --service hiveserever2

hive --service metastore

  1. Make sure hive is properly configured with username/password. You can mark both username/password as empty but you will get an error, you can resolve that by referring this link.

  2. Now read hive table from cluster1 as spark dataframe and write it to hive table of cluster2 after transformation.

    // spark-scala code
    
    val sourceJdbcMap = Map(
     "url"->"jdbc:hive2://<source_host>:<port>", //default port is 10000
     "driver"->"org.apache.hive.jdbc.HiveDriver",
     "user"->"<username>",
     "password"->"<password>",
     "dbtable"->"<source_table>")
    
    val targetJdbcMap = Map(
     "url"->"jdbc:hive2://<target_host>:<port>", //default port is 10000
     "driver"->"org.apache.hive.jdbc.HiveDriver",
     "user"->"<username>",
     "password"->"<password>",
     "dbtable"->"<target_table>")
    
    val sourceDF = spark.read.format("jdbc").options(sourceJdbcMap).load()
    
    val transformedDF = //transformation goes here...
    
    transformedDF.write.options(targetJdbcMap).format("jdbc").save()
    
Mohana B C
  • 5,021
  • 1
  • 9
  • 28
  • thank you for your solution but my requirement is little different. I do not want to write to any hive table. I just to write files to a HDFS location in different cluster. – GearFour Aug 10 '21 at 10:42
1

I was able to read from one HA enabled Hadoop cluster hdfs location and write to another HA enabled hadoop cluster hdfs location using Spark by following the below steps:

1) Check if the KDC in both server is of same or different realms. If it is same then skip this step, other wise setup cross realm authentication between the 2 KDC. One might follow: https://community.cloudera.com/t5/Community-Articles/Setup-cross-realm-trust-between-two-MIT-KDC/ta-p/247026

Scenario-1 : This is a recurring operation of read and write

2) Edit the hdfs-site.xml of source cluster as per the steps mentioned in: https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_administration/content/distcp_between_ha_clusters.html

3) Add the below property in spark conf at the time of application launch: spark.kerberos.access.hadoopFileSystems=hdfs://targetCluster-01.xyz.com:8020 Basically, the value should be the InetSocketAddress of the active namenode.

4) In your code, give the absolute path of your target hdfs location. For eg: df.write.mode(SaveMode.Append).save("hdfs://targetCluster-01.xyz.com/usr/tmp/targetFolder")

Note: In step 4 you can also provide logical path like hdfs://targetCluster/usr/tmp/targetFolder since we have added the target namservice in our hdfs-site.xml.

Scenario-2 : This is an adhoc request where you just need to perform this operation of read and write only once

Skip step#2 mentioned above.

Follow step#3 and step#4 as it is.

PS: The user of the job should have access to both the clusters for this to work.

GearFour
  • 63
  • 8