2

I am relatively new to hadoop ecosystem. My goal is to read hive tables using Apache Spark and process it. Hive is running in EC2 instance. Whereas Spark is running in my local machine.

To do a prototype, I've installed Apache Hadoop by following steps present over here . I've added required environment variables as well. I've started dfs using $HADOOP_HOME/sbin/start-dfs.sh

I've installed Apache Hive by following steps present over here. I've started hiverserver2 and hive metadatastore. I've configured apache derby db (Server mode) in hive. I've created a sample table 'web_log' and added few rows in it using beeline.

I've added below in hadoop core-site.xml

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>

And added below in hdfs-site.xml

<property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>true</value>
</property>

I've added core-site.xml, hdfs-site.xml and hive-site.xml in $SPARK_HOME/conf in my local spark instance

core-site.xml and hdfs-site.xml are empty. i.e.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
</configuration>

hive-site.xml has below content

<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://ec2-instance-external-dbs-name:9083</value>
    <description>URI for client to contact metastore server</description>
  </property>
</configuration>

I've started spark-shell and executed the following command

scala> sqlContext
res0: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@57d0c779

It seems spark has created HiveContext. I've executed sql using below command

scala> val df = sqlContext.sql("select * from web_log")
df: org.apache.spark.sql.DataFrame = [viewtime: int, userid: bigint, url: string, referrer: string, ip: string]

The columns and its types matches the sample table 'web_log' that I've created. Now when I execute scala> df.show, it took some time and throws below error

16/11/21 18:46:17 WARN BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ec2-instance-private-ip:50010]
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3101)
    at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)

It seems DFSClient is using EC2 instances internal ip. And AFAIK, I didn't start any application on port 50010.

Do I need to install and start any other application?

How can make sure that DFSClient uses EC2 instance external IP or external DNS name?

Is it possible to access hive from external spark instance?

Community
  • 1
  • 1
sag
  • 5,333
  • 8
  • 54
  • 91

1 Answers1

2

Add below code snippet to program which you are running ,

hiveContext.getConf.getAll.mkString("\n") this will print which hive metastore its connecting to... you can review all the properties which are not correct.

if they are not what you are looking for, and you cant adjust... due to some limitations then as described the link. you can try like this to point to correct uris... etc

hiveContext.setConf("hive.metastore.uris", "thrift://METASTOREl:9083");
Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • It helped to get out of the error posted in question. Now I set the hive.metastore.uris in hiveContext. But now, I am getting this error ```java.net.ConnectException: Call From delhi/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)``` . It seems spark is trying access hdfs using localhost. I've tried setting ```fs.defaultFS``` in hiveContext, but no use.Please help me – sag Nov 22 '16 at 07:43
  • The above error is because you cant able to connect to specified host – Ram Ghadiyaram Nov 22 '16 at 17:04
  • But why it is trying to access localhost though I set the metastore uri's to EC2 instance? How can configure to use EC2 instance host? – sag Nov 23 '16 at 05:08