A freshly-opened file descriptor starts at position = 0. If you keep reading from the same fd
in a loop, you'll get successive chunks. (Use a larger buffer like 8kiB and loop over dwords in user-space, though, using the value that read
returned as an upper limit! A system call is very expensive in CPU time.)
Is it possible to start reading a file from a specific line or byte.
- Byte: yes
- Line: no. In Unix/Linux, the kernel doesn't have an index of line-start byte offsets or any other line-oriented API. The line handling in stdio
fgets
for example is purely done in user-space. There have been some historical OSes with record-based files, but Unix files are flat arrays of bytes. (They can have holes, unwritten extents, and extended attributes... But the kernel APIs for the main file contents only operate with by byte offsets).
If you want to do lines, read a big block and loop forward until you've seen some number of newlines. If you're not there yet, read another block; repeat until you find the start and end of the line number you want, or you hit EOF. x86-64 can efficiently search 16 bytes at a time with pcmpeqb
/ pmovmskb
/ popcnt
(popcnt requires SSE4.2 or the specific popcnt feature bit).
Or with just SSE2, or when optimizing for large blocks, with pcmpeqb
/ psadbw
(against all-zero) to hsum bytes to qwords / paddd
. Then check how many lines you went every so often with some scalar code. Or keep it simple and branch on finding the first newline in a SIMD vector.
Obviously the slow and simple option is a byte-at-a-time loop that counts '\n'
characters - if you know how to do strchr with SSE2 it should be straightforward to vectorize that search using the above suggestions.
But if you only want some specific byte positions, you have two main options:
seek with lseek(2)
before read(2)
(see @Nicolae Natea's answer)
Use POSIX/Linux pread(2)
to read from a specified offset, without moving the fd's file offset for future read
calls. The Linux system call name is pread64
(__NR_pread64 equ 17
from asm/unistd_64.h
)
ssize_t pread(int fd, void *buf, size_t count, off_t offset);
The only difference from read
is the offset arg, the 4th arg thus passed in R10 (not RCX like the user-space function calling convention). off_t
is a 64-bit type simply passed in a single register in 64-bit code.
Other than the pread64
name in the .h
, there's nothing special about the asm interface compared to the C interface, it follows the standard system-calling convention. (It exists since Linux 2.1.60 ; before that glibc's wrapper emulated it with lseek.)
There are other things you can do like mmap
, or a preadv
system call, but pread is most exactly what you're looking for if you have a known position you want to read from.