2

In the file system of Hadoop I have Excel file.

I have task to copy that file from Hadoop to remote SFTP server in my Scala/Spark application.

I have formed the opinion that directly it will not work. If my fears are correct, I need to make next steps:

1) Remove excel file from Hadoop to local directory. For example I can make it with Scala DSL:

import scala.sys.process._
s"hdfs dfs -copyToLocal /hadoop_path/file_name.xlsx /local_path/" !

2) From local directory send file to remote SFTP server. What kind of library you can recommend for this task?

Is my reasoning correct? What the best way to solve my problem?

Nurzhan Nogerbek
  • 4,806
  • 16
  • 87
  • 193
  • If you can read the excel in a data frame (maybe this [link](https://stackoverflow.com/questions/44196741/how-to-construct-dataframe-from-a-excel-xls-xlsx-file-in-scala-spark) can be useful) you may want to have a look at this [library](https://index.scala-lang.org/springml/spark-sftp/spark-sftp/1.1.5?target=_2.11). – LizardKing Aug 29 '19 at 13:28
  • @LizardKing in my case excel file has several sheets. That file has `.xlsx` format. The output of the application user should get exactly such a file. Unfortunately, `spark-sftp` library is not what I need for several reasons. Do you have any other ideas? – Nurzhan Nogerbek Aug 30 '19 at 07:36

2 Answers2

1

As mentioned in the comment spark-sftp is good choice

if not you can try below sample code from apache-commons-ftp libraries.. which will list all remote files.. similarly you can delete the files as well.. untested pls try it.

Option1:

import java.io.IOException

import org.apache.commons.net.ftp.FTPClient

//remove if not needed
import scala.collection.JavaConversions._

object MyFTPClass {

  def main(args: Array[String]): Unit = {
// Create an instance of FTPClient
    val ftp: FTPClient = new FTPClient()
    try {
// Establish a connection with the FTP URL
      ftp.connect("ftp.test.com")
// Enter user details : user name and password
      val isSuccess: Boolean = ftp.login("user", "password")
      if (isSuccess) {
// empty array is returned
        val filesFTP: Array[String] = ftp.listNames()
        var count: Int = 1
// Iterate on the returned list to obtain name of each file
        for (file <- filesFTP) {
          println("File " + count + " :" + file) { count += 1; count - 1 }
        }
      }
// Fetch the list of names of the files. In case of no files an
// Fetch the list of names of the files. In case of no files an
      ftp.logout()
    } catch {
      case e: IOException => e.printStackTrace()

    } finally try ftp.disconnect()
    catch {
      case e: IOException => e.printStackTrace()

    }
  }

}

Option 2: There is something called jsch library you can see this question and example snippet from SO

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
1

Well, finally I found the way to solve the task. I decided to use jsch library.

build.sbt:

libraryDependencies += "com.jcraft" % "jsch" % "0.1.55"

.scala:

import scala.sys.process._
import com.jcraft.jsch._

// Copy Excel file from Hadoop file system to local directory with Scala DSL.
s"hdfs dfs -copyToLocal /hadoop_path/excel.xlsx /local_path/" !

val jsch = new JSch()
val session = jsch.getSession("XXX", "XXX.XXX.XXX.XXX") // Set your username and host
session.setPassword("XXX") // Set your password
val config = new java.util.Properties()
config.put("StrictHostKeyChecking", "no")
session.setConfig(config)
session.connect()

val channelSftp = session.openChannel("sftp").asInstanceOf[ChannelSftp]
channelSftp.connect()
channelSftp.put("excel.xlsx", "sftp_path/") // set your path in remote sftp server
channelSftp.disconnect()

session.disconnect()
Nurzhan Nogerbek
  • 4,806
  • 16
  • 87
  • 193