1

I know I can read a local file in Scala like so:

import scala.io.Source

val filename = "laba01/ml-100k/u.data"

for(line <- Source.fromFile(filename).getLines){
    println(line)
}

This code words fine and prints out the lines from the text file. I run it in JupyterHub with Apache Toree.

I know I can read from HDFS at this server, because when I run the next code in another cell:

import sys.process._
"hdfs dfs -ls /labs/laba01/ml-100k/u.data"!

it works fine too, and I can see this output:

-rw-r--r--   3 hdfs hdfs    1979173 2020-04-20 17:56 /labs/laba01/ml-100k/u.data

lastException: Throwable = null
warning: there was one feature warning; re-run with -feature for details

0

Now I want to read this same file kept in HDFS by running this:

import scala.io.Source

val filename = "hdfs:/labs/laba01/ml-100k/u.data"

for(line <- Source.fromFile(filename).getLines){
    println(line)
}

but I get this output instead of the file's lines printed out:

lastException = null

Name: java.io.FileNotFoundException
Message: hdfs:/labs/laba01/ml-100k/u.data (No such file or directory)
StackTrace:   at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)

So how do I read this text file from HDFS?

Sergey Zakharov
  • 1,493
  • 3
  • 21
  • 40
  • Does this answer your question? [Read the data from HDFS using Scala](https://stackoverflow.com/questions/41587931/read-the-data-from-hdfs-using-scala) – mazaneicha May 30 '20 at 13:43
  • @mazaneicha no, because 1) it doesn't work and it's kinda old to ask for more explanations there (but I'll try). There's some `URI` object there which leads to `Unknown Error` when I run that code. 2) It needs some host and port, which I don't really care about since I am able to access `HDFS` from the very same server I run my `Scala` code on. – Sergey Zakharov May 30 '20 at 14:14
  • This is HDFS NameNode's host and port, and you need them to access HDFS file system. – mazaneicha May 30 '20 at 14:26

1 Answers1

4

scala.io will not able to find any file in HDFS. It's not for that. If I'm not wrong it can only read file that are in your local (file:///)

You need to use hadoop-common.jar to read the data from HDFS.

You can find code example here https://stackoverflow.com/a/41616512/7857701

Snigdhajyoti
  • 1,327
  • 10
  • 26
  • That code doesn't work and it's kinda old to ask for more explanations there (but I'll try). There's some URI object there which leads to Unknown Error when I run that code. And it needs some host and port, which I don't really care about since I am able to access HDFS from the very same server I run my Scala code on. – Sergey Zakharov May 30 '20 at 14:16
  • 1
    Yes it needs nameserver host and port. OR If you have the service running in the same node put `hdfs:///` (mind the 3 `/`) in the URI – Snigdhajyoti May 30 '20 at 14:19
  • What about `URI` in `new URI`? What is it and how do I import it? – Sergey Zakharov May 30 '20 at 14:24
  • thank you so much! `hdfs:///` did the trick! I imported `URI` as `import java.net.URI` – Sergey Zakharov May 30 '20 at 14:44
  • Yes thats the one. If your name server is on other node you have specify host:port – Snigdhajyoti May 30 '20 at 14:46