-2

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.

The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?

More info: I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:

val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")

val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
  • Have a look on these parallel file processing in Scala using Akka https://stackoverflow.com/questions/11576439/parallel-file-processing-in-scala – Indrajit Swain Oct 04 '17 at 05:54
  • To you want to write it to a local file system? Or are you on a cluster with HDFS etc ? – jojo_Berlin Oct 04 '17 at 06:33
  • 2
    I'm voting to close this question as it is too broad to answer ! – eliasah Oct 04 '17 at 06:46
  • I am using a hadoop cluster but I am open to writing a file to a local filesystem if that proves to be faster – user3750667 Oct 04 '17 at 08:17
  • You probably can narrow your question. Can you measure what is actually slow ? Do you write to a compressed format (e.g. GZIP ? What about switching to Snappy ?) Do you saturate your bandwidth to hadoop (what if you switched to a BufferedOutputStream to a local text file ? What if you compress on the fly) ? What if you manually try to write to several hadoop files ? ... We're sort of in the dark without more details. – GPI Oct 04 '17 at 08:39

1 Answers1

1

Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.

    sparkContext
      .makeRDD(20, edges.toStream)  
      .map(e => e.inVertex.id -> e.outVertex.id)
      .toDF
      .write
      .delimiter(" ")
      .csv(path)

This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

Dima
  • 39,570
  • 6
  • 44
  • 70