Reading from HDFS partition sequentially one record at a time using PySpark/Python

Asked Jul 20 '21 at 10:09

Active Jul 20 '21 at 11:08

Viewed 140 times

I want to read from an HDFS partition one record at a time, sequentially. I found a sample Java snippet that handles this logic. Is there a way to achieve this using PySpark/Python?

Sample Java snippet below (note the while loop):

FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/file1.txt");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();

edited Jul 20 '21 at 11:08

asked Jul 20 '21 at 10:09

Sudipto Dutta

why you need spark then? A simple python prog will does that – dsk Jul 20 '21 at 10:18
My project uses PySpark in all its solutions. But I agree a .py script should be enough for this. Updated the heading and question. Please have a look now. – Sudipto Dutta Jul 20 '21 at 11:09
https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list <- refer that - multiple solutions are there - try to explore gloab module as well - – dsk Jul 20 '21 at 11:14

Reading from HDFS partition sequentially one record at a time using PySpark/Python

0 Answers0