So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info: I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()