I have started Spark like this:
spark-shell --master local[10]
I'm trying to see the files on the underlying Hadoop installation.
I want to do something like this:
hdfs ls
How can I do it?
I have started Spark like this:
spark-shell --master local[10]
I'm trying to see the files on the underlying Hadoop installation.
I want to do something like this:
hdfs ls
How can I do it?
You can execute any underlying system/OS commands (like hdfs dfs -ls or even pure shell/DOS commands) from scala (which comes default with spark) just by importing classes from sys.process package. see below for example
import sys.process._
val oldcksum = "cksum oldfile.txt" !!
val newcksum = "cksum newfile.txt" !!
val hdpFiles = "hdfs dfs -ls" !!
import sys.process._ # This will let underlying OS commands to be executed.
val oldhash = "certUtil -hashFile PATH_TO_FILE" !!#CertUtil is a windows command
If you plan to read and write from/to HDFS in Spark you need to first integrate the spark and hadoop. http://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration
If I understand your question correctly you want to execute HDFS commands from shell. In my opinion running spark job may not help.
You need to start your HDFS instance first. Below are the commands from the documentation. Once HDFS is started you can run the shell commands.
To start a Hadoop cluster you will need to start both the HDFS and YARN cluster.
The first time you bring up HDFS, it must be formatted. Format a new distributed filesystem as hdfs:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format Start the HDFS NameNode with the following command on the designated node as hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode Start a HDFS DataNode with the following command on each designated node as hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode If etc/hadoop/slaves and ssh trusted access is configured (see Single Node Setup), all of the HDFS processes can be started with a utility script. As hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh Start the YARN with the following command, run on the designated ResourceManager as yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager Run a script to start a NodeManager on each designated host as yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config $HADOOP_CONF_DIR start nodemanager Start a standalone WebAppProxy server. Run on the WebAppProxy server as yarn. If multiple servers are used with load balancing it should be run on each of them:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start proxyserver If etc/hadoop/slaves and ssh trusted access is configured (see Single Node Setup), all of the YARN processes can be started with a utility script. As yarn:
[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh Start the MapReduce JobHistory Server with the following command, run on the designated server as mapred:
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver
Second option is programmatic way. You can use FileSystem class from Hadoop (It is a java implementation.) and do the hdfs operations.
Below is the link for javadoc.
https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html
You can see the underlying file system of HDFS using the commands on spark-shell:
import scala.sys.process._
val lsOutput = Seq("hdfs","dfs","-ls","/path/to/folder").!!