I want to read from an HDFS partition one record at a time, sequentially. I found a sample Java snippet that handles this logic. Is there a way to achieve this using PySpark/Python?
Sample Java snippet below (note the while loop):
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/file1.txt");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();