I am relatively new to hadoop ecosystem. My goal is to read hive tables using Apache Spark and process it. Hive is running in EC2 instance. Whereas Spark is running in my local machine.
To do a prototype, I've installed Apache Hadoop by following steps present over here . I've added required environment variables as well.
I've started dfs using $HADOOP_HOME/sbin/start-dfs.sh
I've installed Apache Hive by following steps present over here. I've started hiverserver2 and hive metadatastore. I've configured apache derby db (Server mode) in hive. I've created a sample table 'web_log' and added few rows in it using beeline.
I've added below in hadoop core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
And added below in hdfs-site.xml
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
I've added core-site.xml, hdfs-site.xml and hive-site.xml in $SPARK_HOME/conf in my local spark instance
core-site.xml and hdfs-site.xml are empty. i.e.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
</configuration>
hive-site.xml has below content
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://ec2-instance-external-dbs-name:9083</value>
<description>URI for client to contact metastore server</description>
</property>
</configuration>
I've started spark-shell and executed the following command
scala> sqlContext
res0: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@57d0c779
It seems spark has created HiveContext. I've executed sql using below command
scala> val df = sqlContext.sql("select * from web_log")
df: org.apache.spark.sql.DataFrame = [viewtime: int, userid: bigint, url: string, referrer: string, ip: string]
The columns and its types matches the sample table 'web_log' that I've created.
Now when I execute scala> df.show
, it took some time and throws below error
16/11/21 18:46:17 WARN BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ec2-instance-private-ip:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3101)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
It seems DFSClient is using EC2 instances internal ip. And AFAIK, I didn't start any application on port 50010.
Do I need to install and start any other application?
How can make sure that DFSClient uses EC2 instance external IP or external DNS name?
Is it possible to access hive from external spark instance?